Oct. 15, 2009, 3:21 p.m.
posted by oxy
Item 29: Minimize lock windowsReaders of Effective Java [Bloch] will undoubtedly remember this advice from Item 49: "As a rule, you should do as little work as possible inside synchronized regions." This same principle holds true for enterprise systems, where locks in shared resources create contention, restricting the ultimate scalability of the system. To start with, even in the absence of any kind of contention, the act of acquiring an object monitor and later releasing it is not trivial. In fact, given the performance improvements in recent JVMs regarding object creation, where we used to say that "the most expensive thing you can do in Java is to create an object" (leading to all sorts of pooling mechanisms that are now largely unnecessary—see Item 72), it's now fair to say that "the most expensive thing you can do in Java is to acquire an object monitor." Consider this trivialized example, in which one thread calls a nonsynchronized method 35 million times and after that an exactly equivalent synchronized method another 35 million times, then compares the resultant timings. We're going to write three different versions of this trivial bit of code—one that uses no synchronization at all, one that uses a synchronized modifier on the method call itself, and one that uses a more coarse-grained synchronization approach. Execution will be in two parts—the first part will execute a method 35 million times from a single thread, then we'll spin off a command-line number of threads and let them each execute 35 million times as well. First, the unsynchronized version:
// Driver.java
//
// UnsyncDriver: just let 'em rip
//
public class UnsyncDriver
{
private static int callCount = 0;
public static void call()
{
// Do something to avoid dead code removal optimizations
//
callCount++;
}
static class CallRunner implements Runnable
{
public void run()
{
long start = System.currentTimeMillis();
for (int i=0; i<35000000; i++)
call();
long end = System.currentTimeMillis();
System.out.println(Thread.currentThread().getName() +
" Call() Start: " + start + "ms " +
"End: " + end + "ms Diff = " +
(end - start) + "ms");
}
}
public static void main(String[] args)
{
// Warm up the JIT
for (int i=0; i<100000; i++)
{
call();
}
long start = System.currentTimeMillis();
for (int i=0; i<35000000; i++)
call();
long end = System.currentTimeMillis();
System.out.println("Call() Start: " + start + "ms " +
"End: " + end + " ms " +
"Diff = " + (end - start) + " ms");
if (args.length > 0)
{
int threads = Integer.parseInt(args[0]);
System.out.println("----------- Executing " + threads +
" threads. -----------");
Thread[] threadArray = new Thread[threads];
for (int i=0; i<threads; i++)
threadArray[i] =
new Thread(new CallRunner(), "Thread " + i);
for (int i=0; i<threads; i++)
threadArray[i].start();
}
}
}
The synchronized version is almost identical except that the call method is decorated with the synchronized keyword, and the coarse-grained version uses a private Object lock to lock against in the body of the method itself:
// FineSyncDriver.java
//
// Just like UnsyncDriver, except for the following
public class FineSyncDriver
{
public static synchronized void syncCall()
{
// Do something to avoid dead code removal optimizations
//
syncCallCount++;
}
}
// CoarseSyncDriver.java
//
// Just like UnsyncDriver, except for the following
public class CoarseSyncDriver
{
private static int callCount = 0;
private static Object lock = new Object();
public static void call()
{
// Do something so as to avoid dead code removal
// optimizations
//
syncronized(lock)
{
callCount++;
}
}
}
Running the three versions with 10 threads yields some interesting insights, as shown in the following output:
C:\Projects\EEJ\code> java UnsyncDriver 10
Call() Start: 1076330441441ms End: 1076330441651 ms
Diff = 210 ms
public class CoarseSyncDriver
{
----------- Executing 10 threads. -----------
Thread 1 Call() Start: 1076330441721ms
End: 1076330442402ms Diff = 681ms
Thread 3 Call() Start: 1076330441841ms
End: 1076330442512ms Diff = 671ms
Thread 0 Call() Start: 1076330441661ms
End: 1076330442572ms Diff = 911ms
Thread 2 Call() Start: 1076330441781ms
End: 1076330442572ms Diff = 791ms
Thread 4 Call() Start: 1076330442142ms
End: 1076330443043ms Diff = 901ms
Thread 5 Call() Start: 1076330442202ms
End: 1076330443343ms Diff = 1141ms
Thread 7 Call() Start: 1076330442803ms
End: 1076330443403ms Diff = 600ms
Thread 8 Call() Start: 1076330442753ms
End: 1076330443403ms Diff = 650ms
Thread 9 Call() Start: 1076330442702ms
End: 1076330443413ms Diff = 711ms
Thread 6 Call() Start: 1076330442262ms
End: 1076330443413ms Diff = 1151ms
C:\Projects\EEJ\code> java FineSyncDriver 10
FineCall() Start: 1076330565970ms
End: 1076330566731 ms Diff = 761 ms
------------ Executing 10 threads. ------------
Thread 0 FineSyncCall() Start: 1076330566731ms
End: 1076331238256ms Diff = 671525ms
Thread 8 FineSyncCall() Start: 1076330566791ms
End: 1076331310000ms Diff = 743209ms
Thread 4 FineSyncCall() Start: 1076330566791ms
End: 1076331310010ms Diff = 743219ms
Thread 9 FineSyncCall() Start: 1076330566791ms
End: 1076331310010ms Diff = 743219ms
Thread 2 FineSyncCall() Start: 1076330566791ms
End: 1076331310010ms Diff = 743219ms
Thread 7 FineSyncCall() Start: 1076330566791ms
End: 1076331310010ms Diff = 743219ms
Thread 1 FineSyncCall() Start: 1076330566791ms
End: 1076331310010ms Diff = 743219ms
Thread 3 FineSyncCall() Start: 1076330566791ms
End: 1076331310010ms Diff = 743219ms
Thread 6 FineSyncCall() Start: 1076330566791ms
End: 1076331310010ms Diff = 743219ms
Thread 5 FineSyncCall() Start: 1076330566791ms
End: 1076331310010ms Diff = 743219ms
C:\Projects\EEJ\code> java CoarseSyncDriver 10
CoarseCall() Start: 1076331255291ms
End: 1076331255361 ms Diff = 70 ms
------------ Executing 10 threads. ------------
Thread 0 Call() Start: 1076331255371ms
End: 1076331255641ms Diff = 270ms
Thread 1 Call() Start: 1076331255431ms
End: 1076331255942ms Diff = 511ms
Thread 2 Call() Start: 1076331255431ms
End: 1076331256252ms Diff = 821ms
Thread 3 Call() Start: 1076331255431ms
End: 1076331256553ms Diff = 1122ms
Thread 4 Call() Start: 1076331255431ms
End: 1076331256863ms Diff = 1432ms
Thread 5 Call() Start: 1076331255431ms
End: 1076331257164ms Diff = 1733ms
Thread 6 Call() Start: 1076331255431ms
End: 1076331257474ms Diff = 2043ms
Thread 7 Call() Start: 1076331255431ms
End: 1076331257774ms Diff = 2343ms
Thread 8 Call() Start: 1076331255431ms
End: 1076331258075ms Diff = 2644ms
Thread 9 Call() Start: 1076331255491ms
End: 1076331258265ms Diff = 2774ms
C:\Projects\EEJ\code>
First, we can clearly see that the unsynchronized version is the fastest, taking anywhere from 600 to 1200 milliseconds to execute; the differences are probably due to the thread scheduler and which threads happen to get launched first (and therefore get the longest run before having to yield the CPU). By the way, take careful note that the 10 threads executing the unsynchronized method could easily be corrupting its value entirely, something I'm completely ignoring at the moment. You don't have that luxury, so don't expect that you can improve the performance and scalability of your systems simply by removing synchronized blocks. Your code might run faster, but the late nights debugging the production server just aren't worth it. Second, we can also clearly see that the coarse-grained synchronization version takes a lot less time than the fine-grained synchronization version, but a curious thing kicks in: it's as if the time spent executing is linear in proportion to when the thread was kicked off. In fact, this is precisely what's happening—the first thread scheduled grabs the lock and holds it for the entire 35-million-iteration loop, never yielding it, so the other threads have to wait until that lock gets released before they can contend for it. By the time the last thread gets the lock, that thread has been waiting a while. Third, the fine-grained mechanism, where the monitor is acquired and released on each and every call, yields the most even numbers, yet the worst: roughly 740 seconds, or a full 12 minutes before execution completes for each thread. And this is just for 10 threads; imagine what 20 or 50 or 100 threads will require. Don't draw too much from a single run, but clearly synchronization has its costs, long before we consider the amount of time spent inside the synchronized region. To put it in simple one-sentence terms, contention is the enemy of scalability. The more contention in the system, the more time we spend waiting, which in turn bloats the latency of the system as a whole. We can't scale a system that suffers from contention problems because, remember, the point of scalability is to be able to add users to the system simply by throwing more hardware at the problem—all the hardware in the world won't help us if the latency is caused by everyone having to wait to share a single lock. Toward that end, keep two remedies in mind. First, avoid taking locks when you can help it—even if nobody else ends up contesting you for the lock, it still takes up precious time just to acquire the monitor and yield it back again. This means don't synchronize the entire doGet or doPost method of a servlet "just to be safe." Instead, spend the time analyzing the code to determine where the possible hot spots are and synchronize only those hot spots. Second, keep the time spent inside the synchronized region as small as possible; I'll leave it as an exercise to the unconvinced reader to put a Thread.sleep call in the call method to simulate holding the lock for some length of time, experimenting with varying numbers of threads. Suffice it to say, the longer the call window, and the higher the number of threads, the worse (exponentially so) the times become. (Do yourself a favor, though, if you plan to run the tests yourself: cut the number of iterations down from 35 million to something a bit more reasonable, or you'll spend a lot of time waiting for the tests to complete.) Of course, other remedies exist, too. One approach is to prefer to work with immutable resources (objects, data, whatever) that need no synchronization because they can't change, as described in Item 38. Another is to offload the resource processing into a queue (per Item 20) where it can be handled in asynchronous-yet-serialized fashion, much as a batch process does. The point is still the same: lock as little as possible for as short a time as possible. |
- Comment