Item 8: Define your performance and scalability goals



Item 8: Define your performance and scalability goals

An ancient proverb holds that the journey of a thousand miles begins with a single step. That's not precisely true. The journey of a thousand miles begins with a single step and a destination; otherwise, it becomes a journey of two or three thousand miles, assuming it ever actually ends.

Developers are fastidious in dealing with customers and analysts when it comes to wanting to nail down exact requirements for the features and functionality of a given application. Field validation, use cases, class structure diagrams—all of these and more will be carefully and painstakingly ironed out into a document that stretches hundreds of pages long, yet nobody ever stops to ask, "Exactly how fast should this thing be?" And yet, this becomes one of the classic user complaints regarding a painstakingly architected application: "It's too slow." This of course brings back the immediate response, "Well, buy faster hardware."

The conversation deteriorates pretty quickly at that point.

The problem is that we developers need to know precisely what target we're aiming at if we're expected to hit it with any degree of success. Just as we need to know what the features must look like, how the pages must flow, and how the shopping cart must behave in the event of a VIP customer placing a $150 order over a holiday weekend, we need to know how fast the application has to be if we're to meet user expectations.

Unfortunately, this is easier said than done. Users speak in plain, simple terms: "It needs to be faster. It's too slow. It takes forever." As sympathetic as we may be to these sorts of reactions (c'mon, we've all felt the same way, sitting there impatiently at the Web browser, waiting for Amazon.com or some other Web site to finally respond), they don't give us much to go on. What we need is something quantifiable, rather than just gut-level intuition that, more often than not, doesn't exactly match up with what our users say they want.

What makes it worse is that the very things we need to identify and quantify are notoriously difficult to nail down. Terms like performance, scalability, and capacity get tossed around together in the marketing bowl to produce a stew that's wonderfully attractive but hideously difficult to identify. Popular myth holds that the terms performance and scalability, if not precisely synonymous, are close enough that improving one will quite naturally improve the other. What's good for performance must be good for scalability as well, right?

Not only do performance and scalability mean two very different things, but improvements to one often hurt the other. Other books may use different terminology, but for the purposes of this book, we'll define these and related terms as follows.

  • Performance: Performance, to put it simply, is how quickly the system can respond to a given logical operation from a given individual user. If it takes so long for each page of the application to load when a user is ordering a product online that he or she gets frustrated and purchases the product from a competitor's site instead, obviously the performance of the application is less than optimal. The goal of a well-performing architecture is to achieve lower latency (the amount of time the system requires to respond to a user's request) so as to keep usability and the user's interest in the application high.

  • Scalability: If performance measures the responsiveness of a system for a single user scenario, a "vertical" measurement, then scalability represents its polar opposite: the responsiveness of the system as more and more users enter the system concurrently. The goal of a scalable architecture is to achieve higher throughput (the number of logical operations that the system can process within a specific period of time: operations per second, if you will) as client demand grows, simply by taking advantage of additional hardware without redesign.

  • Response time: This is a measure of the amount of time the system consumes while processing a user request. While frequently applied to the time consumed to respond to a user interface action (clicking a button, selecting a menu item, opening a window, and so forth), response time can also be used in a more granular fashion, as the amount of time consumed by a particular API call or system action (such as processing a SQL call). Response time is made up of three things: latency, wait time, and service time.

  • Latency: This is the amount of time spent processing overhead just to get to the point of carrying out a business action—it is the overhead associated with the system as a whole. Systems with high latency frequently fail because too much time is spent processing overhead and not enough processing actual work.

  • Wait time: The time spent waiting for the server, or, once the server is executing, the time spent waiting for resources.

  • Service time: The time needed to process the request when no waiting is involved.

  • Throughput: This term describes a measurement of how much work can be done for a given period of time, such as transactions per second or bytes per second. Measuring throughput is typically a business-domain action because the concept of a business transaction can vary wildly among systems; most often, we discuss it in more abstract terms, simply as "you can achieve higher throughput by minimizing contention."

  • Load: The current volume of work on the system, its load, can be measured in a variety of units, from coarse-grained (number of users currently using the system) to fine (bytes of memory consumed, CPU cycles used per second, and so on). Typically, load isn't measured in a vacuum—load is used as a baseline against which to measure some other statistic, such as throughput or response time: "When we see the server CPU pegged at anything less than 80% load, we get a response time of around 2 or 3 seconds; when it gets higher than that, however, the response time degrades pretty rapidly, to around 30 or 40 seconds at 90% load."

  • Concurrent load: This term describes the load at any given moment. Load, for example, can be the total number of users a given system can support, where concurrent load describes the number of users that can be supported at a given moment. A servlet engine, for example, may have enough memory to hold 10,000 user sessions in memory before crumpling (load) but may accept only up to 2,000 simultaneous network connections (concurrent load).

  • Capacity: This is the total load and/or concurrent load a given machine/system can handle before being "maxed out."

Frequently, optimizing a system for one of these will in turn help optimize for the other—for example, minimizing contention for resources will not only improve the application's scalability but also lower its latency (since now the system spends less time waiting for locks to be released). However, it's also possible to add enhancements to the system that benefit performance at the expense of scalability or vice versa.

For example, a developer may look at the current performance of a system and decide that it's running too slowly. He decides that the system is making too many trips to the database and that the solution is to cache data in the application server so as to eliminate network I/O if he can. (His first mistake is making this decision without consulting a profiler first; see Item 10.) So he writes a generalized caching mechanism, spends a month fine-tuning the cache algorithm, and starts caching data in the application server. Cue credits, we all go home happy, right?

Unfortunately, no—the story's not that simple. Because many people access the system all at once, the cache needs to be somewhat "global" in order to hold data for all of these users. And if the data has changed, we need to hold all the users at bay while we update the cache with the new data. The developer has perhaps improved performance by adding the cache but has also introduced a new contention point—access to the cache for data—and therefore hurt scalability. In some extreme cases, depending on usage and the synchronization policy of the cache lock, the time gained due to the removed I/O trip is more than lost due to the synchronization lock on the cache. Even worse, this cache-based implementation doesn't scale well; if the system later moves to run on clustered application servers, the cache must now be replicated across all of the nodes in the cluster, meaning that the system now takes both the latency hit of an additional I/O across the network (to check against the global cache) as well as the contention hit of the global cache lock.

Even if the cache has quite an efficient locking mechanism, such that the cost of checking the cache is zero (which will never happen, by the way), scalability is still affected: the cache now occupies memory on the machine, which in turn reduces the total number of concurrent clients that machine can process. The larger the cache, the fewer clients we can support; the smaller the cache, the less effective it becomes. (Sounds like the cache size should probably be hot-configurable; see Item 13.)

To be quite honest, many of the performance problems in enterprise systems aren't, in fact, performance problems—they're scalability problems. The system performs poorly because it's blocked waiting for access to some shared resource that everybody else needs, too. If the system does application-level audit logging to track users' actions in the system, and the developer simply uses the default isolation level (see Item 35 for details) when adding rows to the audit-log table, the database is taking out reader/writer locks on the table—meaning other writers are held at the gate, waiting for the one writer to finish before any others can get started. If we remove the unnecessary locking, the database pumps can add rows into the table much faster—thereby reducing the overall latency of the application.

The tradeoff in the caching example was one of performance for scalability. If the goal of the system is to minimize the response time for each particular user, even if that in turn reduces the capacity of the system, this was a successful step. If, however, the system needs to support as many concurrent users as possible, regardless of the response time for a particular user,[4] this was a terrible step to take and hurt the application's overall ability to meet its goals. So, in the final accounting, was this a successful optimization? That all depends on what the goals of the system are, and if those goals are never stated outright, as developers we'll never know which decisions to make.

[4] As with many things, most enterprise systems find themselves somewhere between these two extremes.

At a practical level, this means a couple of things. First, make sure to optimize optimally (see Item 10) by ensuring you know which 20% of the operations and/or use cases your users are executing 80% of the time. Look for ways to ensure that the users' perceptions of the computer's activity are as short as possible. Note the peculiar phrasing of that last sentence; "the users' perceptions of the computer's activity" is the key here—use techniques and tricks to reduce the amount of time the user is physically unable to move on to the next task. Use multiple threads, use direct database access approaches, use whatever seems appropriate to minimize the user's sense of the system's performance. Find the hook points (see Item 6) that can provide the necessary optimizations if necessary. Or, in some cases, simply build a layer that moves the user out of the physical blocking call—spin off a thread, or post a message to a JMS Queue or Topic for further, asynchronous processing. Frequently, the solutions used here may not actually have an impact on real performance, but as long as the user feels it is fast enough, that's often good enough.

Performance and scalability are two obvious elements of importance in any enterprise system. Treat them as you would any other client feature request: document precisely what's meant by goals like "fast" and "acceptable performance" either through hard numbers or (more likely) some kind of reference point mutually acceptable to both developers and users: "It should have a user interface at least as fast as the system we're replacing." While that takes care of performance, nail down in concrete terms the expected loads and concurrent loads on the system; how many users are expected to use it during a 24-hour period, during a 1-hour period, and so on? Knowing these details at the start of the project will enable you to make better decisions about when to make the inevitable performance-against-scalability tradeoffs. Most importantly, though, having these goals explicitly stated tells you when you can stop performance-tuning, and that's just as important as knowing when to start.