Feb. 6, 2007, 10:42 p.m.
posted by oxy
Item 17: Recognize the cost of network accessAlthough this usually doesn't come as much of a surprise to developers when they stop to think about it, it costs a great deal of time and effort (measured in CPU cycles) to move data across the network. What they don't often realize is how much more expensive, usually about three orders of magnitude (i.e., 1,000 times) more expensive, if not more. Prove it to yourself. Let's say we design a simple API interface that will be implemented in three ways, once inline (i.e., putting the code directly in the caller, rather than making a function call, to test the JVM's efficiency in making method calls), once as a standard in-memory object, and once as an RMI-exported object. We'll host the registry in the same process to keep things simple and even run it all on the same machine to reduce wire transmit time to zero—in other words, the optimal situation we can create for remote objects. The driver behind this test is listed here; IApi is our simple interface, ApiImpl is our RMI-exported implementation, and Driver is, as its name implies, the driver behind the test.
// IApi interface
//
public interface IApi extends java.rmi.Remote {
public int function(int k, int i)
throws java.rmi.RemoteException;
}
//
// ApiImpl not shown here for brevity
//
// Driver code
//
public class Driver
{
public static final int J_LOOP = 7;
public static final int I_LOOP = 5000000;
public static void main(String[] args)
{
init();
noFunctionCall();
functionCall();
noFunctionCall();
functionCall();
rmiCallOnLocalHost();
System.exit(0);
}
private static void init()
{
int k = 0;
// Warm everything up (ClassLoading, JIT, etc.)
//
System.currentTimeMillis();
for (int i = 0; i < I_LOOP / 1000; ++i)
{
for (int j = 0; j < J_LOOP; ++j)
{
// Do something to avoid removal
if (j < i)
{
++k;
}
else
{
--k;
}
}
}
System.currentTimeMillis();
ApiImpl.startServer();
try
{
IApi api = (IApi) java.rmi.Naming.lookup("API");
for (int i = 0; i < I_LOOP / 100000; ++i)
{
api.function(k, i);
}
}
catch (Exception e)
{
e.printStackTrace();
System.exit(1);
}
}
public static void noFunctionCall()
{
int k = 0;
// Now do real timings. This is used to remove all
// computing time from other tests
long start = System.currentTimeMillis();
for (int i = 0; i < I_LOOP; ++i)
{
for (int j = 0; j < J_LOOP; ++j)
{
// Do something to avoid removal
if (j < i)
{
++k;
}
else
{
--k;
}
}
}
long end = System.currentTimeMillis();
displayResults("No Function Call", k, start, end);
}
private static void rmiCallOnLocalHost()
{
int k = 0;
try
{
IApi api = (IApi) java.rmi.Naming.lookup("API");
long start = System.currentTimeMillis();
for (int i = 0; i < I_LOOP; ++i)
{
k = api.function(k, i);
}
long end = System.currentTimeMillis();
displayResults("API Call", k, start, end);
}
catch (Exception e)
{
e.printStackTrace();
System.exit(1);
}
}
public static void functionCall()
{
int k = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < I_LOOP; ++i)
{
k = function(k, i);
}
long end = System.currentTimeMillis();
displayResults("Function Call", k, start, end);
}
private static int function(int k, int i)
{
for (int j = 0; j < J_LOOP; ++j)
{
// Do something to avoid removal
if (j < i)
{
++k;
}
else
{
--k;
}
}
return k;
}
private static void displayResults(String desc, int k,
long start, long end)
{
System.out.println("k = " + k + ", " + desc + ": " +
(end - start) + "ms");
}
}
When executed on my laptop, this code returned the following results. C:\Prg\Projects\Publications\Books\EEJ\code>java Driver k = 34999944, No Function Call: 340ms k = 34999944, Function Call: 271ms k = 34999944, No Function Call: 190ms k = 34999944, Function Call: 351ms k = 34999944, API Call: 437559ms As you can see, that's some significant difference between the No Function Call and the Function Call times compared against the remote API Call time. And worse, this experiment doesn't even involve the wire in any way. We're just measuring the cost of marshaling and moving it down to the loopback adapter in the TCP/IP stack. (Ironically, note that it took longer to do the Function Call times the second time around, whereas the inline No Function Call times got better, probably due to JIT hot-spot inlining. I can't honestly explain why the Function Call times got worse; perhaps JIT compilation of the function call code actually hurt more than it helped, possibly due to interpreted-to-native transition boundaries. Fortunately, we're talking a difference of .08 seconds over 35 million calls, so it's probably not something to worry about.) To put this into perspective, imagine for a moment that you're hungry: you want a sandwich. You go to the refrigerator, and you discover that you're completely out of everything you need to make a sandwich—no bread, no mustard, no ham, no lettuce, no tomato, nothing. So you figure you'll head down to the grocery store (which just happens to be our baseline "local method call") and get the stuff. It takes you twenty minutes to get your keys, grab some cash, drive to the store, park the car, go inside, shop, go back outside, unlock the car, drive home, go inside, unpack the stuff, and spread it out on the table, ready to go. Now imagine that instead of going to the grocery store down the block, you hear of a really good deli your buddy uses (the "remote method call"). If it takes you twenty minutes to go to the local deli, and a remote call takes three orders of magnitude (or more) longer to execute, you're spending roughly the same amount of time in transit as it takes you to travel to Pluto. (That had better be a really, really good deli.) And Heaven help us if the deli wants to sell us only one item at a time or we forget our list of ingredients and have to keep going back for each item, one at a time (see Item 18). Where does all the cost come from? A variety of things, as it turns out—marshaling the parameters into wire-friendly format from their in-memory representation (which is pretty minimal, in this case, since all we're really passing across the boundary is isomorphic types like int, which require no work to transmit), as well as passing the marshaled data down the TCP/IP stack to the localhost loopback part of the TCP/IP adapter (which then turns around and immediately passes it back up the stack), and vice versa for the response on the other side. Where the exact time loss comes from is not significant—it's the fact that, in the aggregate, it's hideously time-expensive to move data across the network. This, along with the "identity breeds contention" problem (see Item 5), is what kills most distributed object systems: good object-oriented design encourages small, atomic objects that focus on doing one thing well and defer all other requests to other objects via method calls, which in distributed object systems usually means network access. Imagine, for a moment, what happens to your performance if you combine the Visitor pattern [GOF, 331] with remote method calls—with each method call between the objects traveling across the wire, and each traversal requiring at least two or more method calls, you're looking at a significant amount of time spent just going back and forth between remote objects. Ouch. This isn't just an RPC thing, by the way—any technology that involves moving data across machines (or even just across process boundaries) has to go through the same kinds of gyrations that the RPC toolkits do. JDBC, JMS, HTTP, all of them spend a certain amount of time turning objects into 1s and 0s, and all of them are still limited by the speed (or lack thereof) of the underlying wire. For these reasons, make each trip to Pluto count. Make sure you pass data in bulk when you can (see Item 23), consider moving data closer to the processors or vice versa (see Item 4), and even think about taking the time to write smarter RPC proxies (see Item 24). Anything you can do to minimize the amount of time spent on the wire, as long as it doesn't interfere with the overall goals of middleware in general, will pay off in better performance. |
- Comment