Item 10: Never optimize without profiling first



Item 10: Never optimize without profiling first

The 80/20 rule, officially known as the Pareto Principle and one of the most famous (and most consistent) rules in computer science, states in its most common variation that 80% of a system's resources are used by about 20% of the code: 80% of the runtime is spent in 20% of the code, 80% of memory is used by some 20% of the code, and so on. The rule can be recursively applied to other aspects of the software development lifecycle as well—that 80% of the requirements document will cover only 20% of the necessary work, for example—but here the concern is focused on the 80/20 rule's application to software execution and engineering. This 80/20 rule has been repeatedly verified through examinations of countless machines, operating systems, applications, and enterprise systems; it's more than just a catchy phrase. The 80/20 rule is one of the few truisms about system performance that has both wide applicability and a solid empirical grounding.

Note that the numbers here are somewhat flexible, depending on your situation: your particular system may follow a split more like 90/10 or 70/30 for a given category than an 80/20 split. The point here isn't to focus on the actual numbers; the point is that the overall performance of your system is dependent on a much smaller percentage of the code than you might imagine at first.

As an enterprise system developer, striving to best enhance your system's performance and scalability, the 80/20 rule both simplifies and complicates your life. On the one hand, life is much simpler now—you can ignore most of the things you'd be tempted to do "to improve performance." Loop unrolling, using an array instead of a Collection class, manually inlining the code for a method call, all of these are optimizations that will yield benefits that are miniscule in nature; chances are 4 in 5 that this code isn't part of that core 20%. While it may feel strange to take that attitude, it will have the unintended benefit of keeping code somewhat clearer—more unreadable code is written in the name of performance than any other excuse. (After all, how many of us write crappy, illegible code just because we want to?)

Some people argue that the virtual machine is in a far better position to do that kind of optimization than the developer is because the JVM has the ability to perform method inlining, generate entirely different machine instructions based on usage patterns, and so on, but this is entirely JVM-specific and probably shouldn't be relied on if vendor neutrality is important to you and/or your organization (see Item 11).

The downside to the 80/20 rule, however, is that now it's up to us to find that 20% when performance problems do arise. You have to isolate and diagnose the 20% of the code that's being called 80% of the time, before you can tackle how to streamline it. What's the problem with this scenario? Most developers rely on their intuition (also known as "experience" or "the article I read the other day") to tell them where the bottlenecks and inefficiencies are. Developer after developer, when presented with a performance (or scalability) problem, will page randomly through screen after screen of code, then turn and with a solemn expression declare that the problem exists due to network delays, the JVM, the lack of aggressive optimization in the Java compiler, the operating system, or some bonehead manager's decision not to use EJB for this project. Quite frequently, the suggested solution centers around buying faster hardware, upgrading to the latest JDK, or rewriting the entire thing from scratch.

The problem with these pronouncements is that most developers' intuition sucks. Blind, random guesses often produce results similar to "experienced" developers' intuition. It's unfortunate, but the complex layers of interaction between the developer and the raw hardware mean that frequently the problem isn't where the developers' intuition leads them to believe it is. So if we can't trust our intuition, what can we trust?

Trust the system itself, of course; "Let the system tell you." Loosely translated, that means "run a profiler over the code."

Unfortunately, within the J2EE arena, this is easier said than done—Java profilers are barely mature, only really coming into usefulness in 2004. Worse yet, J2EE-based profiling is a difficult task, particularly since there are usually several machines involved. It's not quite impossible, just hard, to set up one Java profiler instance to run on the servlet container, another on the application server (if present) or any other Java-based middleware, and combine this with output from the database's own profiler tools and some network analysis tools to produce some good notions of where bottlenecks are occurring. With time, the J2EE profiler tools market will get stronger, and all-in-one profiling solutions will emerge to make this task simpler.

In the meantime, other tools offer much the same kind of data, and those tools are available right now.

First, and at the lowest level of the tools available, are the performance-monitoring tools available to your operating system, such as Performance Monitor (PerfMon) for Windows, or the /proc filesystem under Linux. While they can't tell you exactly how many Java objects are being created or how many classloads your JVM is being forced to undertake at startup, they can tell you equally important statistics like total memory footprint, the number of page swaps the process is going through as it uses the virtual memory mechanisms available to it, and the number of threads being spawned.

Always Check The Profiler First

A VisualWorks (Smalltalk) customer was very concerned about the performance of their system. They were concerned that all the horror stories they heard about Smalltalk being too slow for real systems were true. It seemed that the fundamental section of their system was taking minutes to do anything, and they needed it to take seconds. They couldn't figure out what was taking so long no matter how hard they tried.

In a last-ditch effort to figure out whether their work could be salvaged, Dave Leibs, Smalltalk performance guru from ParcPlace Systems, Inc., was called in for a day of consulting. The customer quickly described an overview of what they were trying to do. Nothing stood out to Dave as anything that would bring Smalltalk to its knees. So, he said,

"Where did the profiler indicate most of the time was being spent?"

"What's the profiler?"

So he spent the next 30 minutes or so explaining what the profiler was and helping them install it.

He then ran the profiler on the section of code in question and found it was spending more than 98% of its time in Date>>printOn:. Dave thought for a second what might be so slow about Date>>printOn: and felt it should not cause so great a problem. So, he looked a little higher up the profile tree and examined the containing method.

He noticed the expression 'Date today printString' inside of a loop. So he chose an Experiment and pulled the expression out of the loop, storing the value in a temporary variable that he then used inside the loop. Ten minutes after running the profiler for the first time, he ran the code again. This critical section of code that was central to the client's system now took less than a second.

Dave then turned to his hosts and said, "So, what are we going to do for the rest of the day?"[5]


[5] Quoted from an article written by Ken Auer of RoleModel Software and Kent Beck of Three Rivers Consulting; see http://www.rolemodelsoft.com/patterns/lazyopt.htm.

For example, under PerfMon you can create counters that track not only the number of threads being spawned by the process but also the number of context switches per thread and the amount of user, privileged (kernel), and processor time. This can be useful in tracking down threads blocked in native code inside the operating system, if your system happens to call into native code for some reason. The PerfMon Process counter allows you to track handle counts (a handle is roughly equivalent to a file descriptor in UNIX), thereby giving you a window into how many files and other kernel objects are being opened by the application server—a good way to track any "leaking" resources that aren't being finalized in a timely manner. The Process counter also tracks process Working Set and Working Set Peak sizes, giving you an operating system perspective as to how much memory the process consumes. This can be invaluable when choosing parameters to start the virtual machine itself (-Xms and –Xmx, for starters; see Item 68 for details). And the "Page faults/sec" statistic in the Process counter is also a good item to keep an eye on, since excessive numbers of page faults imply that your process is swapping to virtual memory a great deal and could be hurting performance terribly. (Determining what constitutes "excessive" numbers of page faults is always going to be a relative decision, however, and requires you to be vigilant about gathering a large statistical base before making any educated guesses about what's excessive and what's normal.)

PerfMon and most other operating system profilers also allow you to track these statistics over a long period of time, taking snapshots every x number of seconds, minutes, or hours, thus giving you the ability to examine the data over a long period of execution time. At the very least, set up these counters on your QA or development servers as soon as the application starts to get more routinely tested, so that by the time the application prepares for a production release, you can get a good idea of what's "normal" for this application. In addition, most profiling tools give you the ability to set alerts, so that if the application reaches beyond a certain threshold, system administrators can receive some kind of heads-up about a potential problem. For example, you can configure PerfMon to send a network message to a system administrator's machine whenever the thread count on the application server process or servlet container process broaches some magical value you set—typically a sign of either a malconfigured thread pool setting or an attempted denial-of-service attack that had better be dealt with pretty quickly. Best of all, this monitoring doesn't require any additional work on your part to use (see Item 12).

Don't ignore other operating system–specific tools, either. As much as Java and the JVM try to abstract away the underlying operating system, there are times when having a good view of what's going on underneath the JVM can be a powerful debugging and diagnostic tool. For example, a classic question that pops up on mailing lists is "When I run the servlet container as a standard process from the command line, everything works fine, but when I try to run it as a service under NT, nothing works. Now what?" The problem here is that most services, when installed, run as a lower-privileged account than you might expect (and this is a good thing—see Item 60), and as such don't have access to network file shares or may not have security rights to files on certain parts of the file system. Using ProcessExplorer or Filemon from Sysinternals.com can be a huge help here in diagnosing these kinds of problems. Similar tools exist in the Linux and other UNIX-based operating system spaces, either from the vendors themselves or from the open-source community—get to know them well before you need them.

Operating system profilers aren't your only option here, also—remember that in most enterprise IT applications, the other major player is the database itself, and keeping an eye on what's happening there can be just as instructive, if not more so, than watching the operating system process. Every major database ships with some kind of query analyzer or similar profiling and/or diagnostic tool, and it's definitely in your best interest to keep an eye on the database's execution behavior. Again, this falls into the "better to be comfortable with the tool before your project's life depends on it" category—far better to have profile traces running constantly on the production server before you get crushed with a burst load, so that any bottlenecks that might arise in such a situation can be identified quickly and dealt with.

The advantage of the database profiler tools is simple: they know only what the database knows and make no assumptions about what client tools are currently running. They can provide a powerful and useful view into what's being executed against your database—anything and everything ranging from SQL injection attack attempts (see Item 61) to badly generated CMP entity bean and/or JDO queries (see Item 50) will show up, along with the corresponding time required to execute each of those statements. It also serves as an important statistic when evaluating your application's execution performance—for example, if a given user request takes a full 60 seconds to execute in user time (from click to finished display results), knowing whether the database takes 5 seconds or 55 seconds makes a huge difference in where you start looking to optimize. More importantly, you can see exactly what's being executed against the database as part of that one user request—after all, it might not be any one particular query that's creating the bottleneck but the fact that you're running fifteen different "fast" queries, each from a different part of the code, that's creating the slowdown. And if you don't know what a full table scan is, somebody else on your team should; if not, it's time to crack open a good database profiling/optimization book.

Of course, nothing beats the ability to have a peek inside the JVM itself, and while this has historically been difficult due to a lack of a standard API across JVMs to provide profiling information—the Java Virtual Machine Profiler Interface (JVMPI), introduced in JDK 1.2.2, was never officially promoted beyond experimental status—two mitigating factors arise. First, JSR 174 (Monitoring and Management Specification for the JVM) will provide a number of Java Management Extensions (JMX) MBeans and SNMP MIBs over elements of the JVM itself, such as thread information, JIT compilation time, memory statistics, and other important JVM elements. Second, despite the current lack of a standardized API, a number of available commercial profilers provide most of the same information. Be careful with these commercial profilers, however—some of them require executing your Java code against their Sun-licensed-and-modified JVM implementation, which means that your execution could vary wildly on your production JVM—for example, the Sun 1.4.x JVM has seen some significant changes and enhancements just between 1.4.0 and 1.4.1. Other profilers gather their statistics by instrumenting your code (i.e., inserting additional bytecode into your compiled classes that gathers the profiling information desired), which can also throw off the profiling, in some cases significantly. A single-line method could be and usually is inlined by the JIT, but instrumentation inserted into that method could easily push the method out of the JIT compiler's heuristics for inlining, thus changing the performance of that method drastically.

Note that as part of JDK 1.5, Sun introduced a replacement for JVMPI and the debug equivalent—JVMDI—called the Java Virtual Machine Tools Interface (JVMTI), and this is expected to be the standard API for both profilers and debuggers. JDK 1.5 isn't scheduled to ship until late 2004, however, and even then, if it follows the same adoption patterns as its predecessors, it will be several years before it achieves widespread use.

Sun also makes a fairly low-level profiler tool available, called hprof, that will gather simple call-execution statistics over the lifetime of the JVM and dump them into a text file for you. While better than nothing, hprof output can be voluminous for J2EE applications, and it can be extremely difficult to sort out your application's execution path from that of the application server itself (assuming the application server is, as most are, written in Java). Still, it's better than making an "educated guess" or "intuitive decision."

You can always write your own "mini-profiler" against the JVMPI APIs directly, assuming you're comfortable with C++ and shared object libraries on your operating system. The JVMPI API is essentially a callback-driven approach, where on startup you have the opportunity to register for certain events, such as object creation and/or destruction, classload and/or unload, thread start, stop, and wait, and so on. (This is not a path recommended for the faint of heart, however—be prepared to invest substantial time if you do this. It does permit you to be as specific and unobtrusive as possible about capturing only the data you're interested in, which for time-critical measurements could be an issue, but again, tread this road carefully.)

Regardless of the tools used, make sure that the profiler has a fair chance to tell you what's going on within the system—that means ensuring that the profiler isn't being misled by the data being fed through the system. When profiling, it's absolutely critical to run the system under normal load conditions, if possible, even if those conditions are artificially induced. Tools can be used to mock up client Web-browsing sessions. If your application uses a rich front end (see Item 51), focus on creating mock sessions against the middleware by instantiating client-side proxies directly and calling against the middleware. Remember that a profiler cannot tell you how your program behaves in general—it can only tell you how the program behaved on a given set of runs. Make sure the data the profiler sees is representative so it can tell you what parts are being executed 80% of the time, leaving you to focus on how to optimize those parts. And once you've identified the parts that need optimization, use the hook points (see Item 6) to bypass normal processing to put in the necessary optimizations.