Item 29: Minimize lock windows



Item 29: Minimize lock windows

Readers of Effective Java [Bloch] will undoubtedly remember this advice from Item 49: "As a rule, you should do as little work as possible inside synchronized regions." This same principle holds true for enterprise systems, where locks in shared resources create contention, restricting the ultimate scalability of the system.

To start with, even in the absence of any kind of contention, the act of acquiring an object monitor and later releasing it is not trivial. In fact, given the performance improvements in recent JVMs regarding object creation, where we used to say that "the most expensive thing you can do in Java is to create an object" (leading to all sorts of pooling mechanisms that are now largely unnecessary—see Item 72), it's now fair to say that "the most expensive thing you can do in Java is to acquire an object monitor."

Consider this trivialized example, in which one thread calls a nonsynchronized method 35 million times and after that an exactly equivalent synchronized method another 35 million times, then compares the resultant timings. We're going to write three different versions of this trivial bit of code—one that uses no synchronization at all, one that uses a synchronized modifier on the method call itself, and one that uses a more coarse-grained synchronization approach. Execution will be in two parts—the first part will execute a method 35 million times from a single thread, then we'll spin off a command-line number of threads and let them each execute 35 million times as well. First, the unsynchronized version:






// Driver.java

//

// UnsyncDriver: just let 'em rip

//

public class UnsyncDriver

{

  private static int callCount = 0;



  public static void call()

  {

    // Do something to avoid dead code removal optimizations

    //

    callCount++;

  }



  static class CallRunner implements Runnable

  {

    public void run()

    {

      long start = System.currentTimeMillis();

      for (int i=0; i<35000000; i++)

        call();

      long end = System.currentTimeMillis();

      System.out.println(Thread.currentThread().getName() +

        " Call() Start: " + start + "ms " +

        "End: " + end + "ms Diff = " +

        (end - start) + "ms");

    }

  }



  public static void main(String[] args)

  {

    // Warm up the JIT

    for (int i=0; i<100000; i++)

    {

      call();

    }



    long start = System.currentTimeMillis();

    for (int i=0; i<35000000; i++)

      call();

    long end = System.currentTimeMillis();

    System.out.println("Call() Start: " + start + "ms " +

                       "End: " + end + " ms " +

                       "Diff = " + (end - start) + " ms");



    if (args.length > 0)

    {

      int threads = Integer.parseInt(args[0]);

      System.out.println("----------- Executing " + threads +

                         " threads. -----------");



      Thread[] threadArray = new Thread[threads];

      for (int i=0; i<threads; i++)

        threadArray[i] =

          new Thread(new CallRunner(), "Thread " + i);



      for (int i=0; i<threads; i++)

        threadArray[i].start();

    }

  }

}


The synchronized version is almost identical except that the call method is decorated with the synchronized keyword, and the coarse-grained version uses a private Object lock to lock against in the body of the method itself:






// FineSyncDriver.java

//

// Just like UnsyncDriver, except for the following

public class FineSyncDriver

{

  public static synchronized void syncCall()

  {

    // Do something to avoid dead code removal optimizations

    //

    syncCallCount++;

  }

}



// CoarseSyncDriver.java

//

// Just like UnsyncDriver, except for the following

public class CoarseSyncDriver

{

 private static int callCount = 0;

 private static Object lock = new Object();

 public static void call()

 {

  // Do something so as to avoid dead code removal

  // optimizations

  //

  syncronized(lock)

  {

   callCount++;

  }

 }

}


Running the three versions with 10 threads yields some interesting insights, as shown in the following output:






C:\Projects\EEJ\code> java UnsyncDriver 10

Call() Start: 1076330441441ms End: 1076330441651 ms

                                   Diff = 210 ms

public class CoarseSyncDriver

{

----------- Executing 10 threads. -----------

Thread 1 Call() Start: 1076330441721ms

                End: 1076330442402ms Diff = 681ms

Thread 3 Call() Start: 1076330441841ms

                End: 1076330442512ms Diff = 671ms

Thread 0 Call() Start: 1076330441661ms

                End: 1076330442572ms Diff = 911ms

Thread 2 Call() Start: 1076330441781ms

                End: 1076330442572ms Diff = 791ms

Thread 4 Call() Start: 1076330442142ms

                End: 1076330443043ms Diff = 901ms

Thread 5 Call() Start: 1076330442202ms

                End: 1076330443343ms Diff = 1141ms

Thread 7 Call() Start: 1076330442803ms

                End: 1076330443403ms Diff = 600ms

Thread 8 Call() Start: 1076330442753ms

                End: 1076330443403ms Diff = 650ms

Thread 9 Call() Start: 1076330442702ms

                End: 1076330443413ms Diff = 711ms

Thread 6 Call() Start: 1076330442262ms

                End: 1076330443413ms Diff = 1151ms



C:\Projects\EEJ\code> java FineSyncDriver 10

FineCall() Start: 1076330565970ms

           End: 1076330566731 ms Diff = 761 ms

------------ Executing 10 threads. ------------

Thread 0 FineSyncCall() Start: 1076330566731ms

                        End: 1076331238256ms Diff = 671525ms

Thread 8 FineSyncCall() Start: 1076330566791ms

                        End: 1076331310000ms Diff = 743209ms

Thread 4 FineSyncCall() Start: 1076330566791ms

                        End: 1076331310010ms Diff = 743219ms

Thread 9 FineSyncCall() Start: 1076330566791ms

                        End: 1076331310010ms Diff = 743219ms

Thread 2 FineSyncCall() Start: 1076330566791ms

                        End: 1076331310010ms Diff = 743219ms

Thread 7 FineSyncCall() Start: 1076330566791ms

                        End: 1076331310010ms Diff = 743219ms

Thread 1 FineSyncCall() Start: 1076330566791ms

                        End: 1076331310010ms Diff = 743219ms

Thread 3 FineSyncCall() Start: 1076330566791ms

                        End: 1076331310010ms Diff = 743219ms

Thread 6 FineSyncCall() Start: 1076330566791ms

                        End: 1076331310010ms Diff = 743219ms

Thread 5 FineSyncCall() Start: 1076330566791ms

                        End: 1076331310010ms Diff = 743219ms



C:\Projects\EEJ\code> java CoarseSyncDriver 10

CoarseCall() Start: 1076331255291ms

             End: 1076331255361 ms Diff = 70 ms

------------ Executing 10 threads. ------------

Thread 0 Call() Start: 1076331255371ms

                End: 1076331255641ms Diff = 270ms

Thread 1 Call() Start: 1076331255431ms

                End: 1076331255942ms Diff = 511ms

Thread 2 Call() Start: 1076331255431ms

                End: 1076331256252ms Diff = 821ms

Thread 3 Call() Start: 1076331255431ms

                End: 1076331256553ms Diff = 1122ms

Thread 4 Call() Start: 1076331255431ms

                End: 1076331256863ms Diff = 1432ms

Thread 5 Call() Start: 1076331255431ms

                End: 1076331257164ms Diff = 1733ms

Thread 6 Call() Start: 1076331255431ms

                End: 1076331257474ms Diff = 2043ms

Thread 7 Call() Start: 1076331255431ms

                End: 1076331257774ms Diff = 2343ms

Thread 8 Call() Start: 1076331255431ms

                End: 1076331258075ms Diff = 2644ms

Thread 9 Call() Start: 1076331255491ms

                End: 1076331258265ms Diff = 2774ms



C:\Projects\EEJ\code>


First, we can clearly see that the unsynchronized version is the fastest, taking anywhere from 600 to 1200 milliseconds to execute; the differences are probably due to the thread scheduler and which threads happen to get launched first (and therefore get the longest run before having to yield the CPU). By the way, take careful note that the 10 threads executing the unsynchronized method could easily be corrupting its value entirely, something I'm completely ignoring at the moment. You don't have that luxury, so don't expect that you can improve the performance and scalability of your systems simply by removing synchronized blocks. Your code might run faster, but the late nights debugging the production server just aren't worth it.

Second, we can also clearly see that the coarse-grained synchronization version takes a lot less time than the fine-grained synchronization version, but a curious thing kicks in: it's as if the time spent executing is linear in proportion to when the thread was kicked off. In fact, this is precisely what's happening—the first thread scheduled grabs the lock and holds it for the entire 35-million-iteration loop, never yielding it, so the other threads have to wait until that lock gets released before they can contend for it. By the time the last thread gets the lock, that thread has been waiting a while.

Third, the fine-grained mechanism, where the monitor is acquired and released on each and every call, yields the most even numbers, yet the worst: roughly 740 seconds, or a full 12 minutes before execution completes for each thread. And this is just for 10 threads; imagine what 20 or 50 or 100 threads will require.

Don't draw too much from a single run, but clearly synchronization has its costs, long before we consider the amount of time spent inside the synchronized region.

To put it in simple one-sentence terms, contention is the enemy of scalability. The more contention in the system, the more time we spend waiting, which in turn bloats the latency of the system as a whole. We can't scale a system that suffers from contention problems because, remember, the point of scalability is to be able to add users to the system simply by throwing more hardware at the problem—all the hardware in the world won't help us if the latency is caused by everyone having to wait to share a single lock.

Toward that end, keep two remedies in mind. First, avoid taking locks when you can help it—even if nobody else ends up contesting you for the lock, it still takes up precious time just to acquire the monitor and yield it back again. This means don't synchronize the entire doGet or doPost method of a servlet "just to be safe." Instead, spend the time analyzing the code to determine where the possible hot spots are and synchronize only those hot spots.

Second, keep the time spent inside the synchronized region as small as possible; I'll leave it as an exercise to the unconvinced reader to put a Thread.sleep call in the call method to simulate holding the lock for some length of time, experimenting with varying numbers of threads. Suffice it to say, the longer the call window, and the higher the number of threads, the worse (exponentially so) the times become. (Do yourself a favor, though, if you plan to run the tests yourself: cut the number of iterations down from 35 million to something a bit more reasonable, or you'll spend a lot of time waiting for the tests to complete.)

Of course, other remedies exist, too. One approach is to prefer to work with immutable resources (objects, data, whatever) that need no synchronization because they can't change, as described in Item 38. Another is to offload the resource processing into a queue (per Item 20) where it can be handled in asynchronous-yet-serialized fashion, much as a batch process does. The point is still the same: lock as little as possible for as short a time as possible.