Item 37: Replicate resources when possible to avoid lock regions



Item 37: Replicate resources when possible to avoid lock regions

Imagine, for just a moment, that you are a kindergarten teacher. Twenty kids, all five years old or so, are under your care for what feels like forever but in fact is just four hours a day. It comes time for recess, and you take them out to the playground to burn off some energy. Little Johnny grabs the jump rope and starts skipping. Naturally, children being what they are, nineteen other voices cry "No fair!" and all want to start using the lone jump rope, all at the same time. Your first inclination is to tell each of them to take turns. After all, that's what you're supposed to do as a teacher: teach children to be well-behaved citizens of the world, starting with lessons on how to share.

Whoever suggested this idea has obviously never had to try to keep nineteen five-year-olds in a single line for anything longer than two minutes.

Here's the problem: assume that each child gets to jump rope for one minute, then has to hand the jump rope over to the next child in line and get back to the end of the line. This means that for nineteen minutes (assuming each one is perfectly willing to surrender the rope at the end of his or her turn, not likely with five-year-olds), each child is doing absolutely nothing. This translates into one minute of activity for every twenty minutes, and that doesn't translate into much energy being burned off.

Solutions? One approach would be to force each child to jump faster, restricting each one to thirty seconds per turn, but all this does is speed up the churn in the line—it's still a ratio of one minute of activity per twenty-minute segment. You might instead choose to count the child's turn by the number of jumps, but this penalizes the fast jumpers, since the slow jumpers thus get more time with the rope and create even more inactivity for the fast jumpers.

Ask any kindergarten teacher how to solve this problem, and they'll tell you: buy more jump ropes.

The central problem with the single resource is that frequently these resources create points of contention that have to be managed using synchronization constructs (either Java object monitors or database locks or whatever). A single point of contention creates a system that won't scale because additional hardware will just introduce additional clients that are competing for that single point of contention; in other words, the lone jump rope situation only gets worse when we try to scale the kindergarten classroom up from twenty students to two hundred or two thousand.

So, remembering Item 21, we start looking for ways to partition the resource up, this time to reduce load on the system by removing the point of contention.

In the simplest cases, this is just a matter of choosing to create duplicate Java objects, rather than trying to force all processing through a single one; for example, imagine a servlet/JSP that has to do some date formatting. It might be tempting to pool your object (which is itself an Inherently Bad Idea, as described in Item 72) and create just one Simple DateFormat instance used by all callers against the servlet, but you'll quickly discover that SimpleDateFormat isn't thread-safe and must be synchronized externally.

Stop right here and take a step back. In this case, the cost of having a single resource—the SimpleDateFormat instance—will be far outstripped by the cost of having to acquire and release an object monitor. Replicating this, by creating a SimpleDateFormat on each incoming HTTP request, will allow you to eliminate the point of contention and let this thread continue onward at its own pace. In particular, this is an easy win, since the SimpleDateFormat itself has no identity (see Item 5), per se, just the state given to it at construction to indicate what format it should have.

We start running into complications with replicated resources when we start thinking about replicating identity-bound resources, most notably the database (or extensions thereof, like entity beans or persistent data objects as in JDO). Once again we run into the problem of update propagation, introduced in Item 21—if we update the copy of the Person record held on machine A, we need to make sure that every copy of that same Person record gets updated on every other machine in the cluster, or else clients reading this same Person record could conceivably get different results, thus ruining consistency. We have the problem that synchronizing this update propagation effectively requires not only all of the threads running in this JVM but also all of the threads running in any JVM on any machine, since this lock will have to be cluster-wide. It will be as if there's just one jump rope for the entire school, not just one per classroom.

Returning to prior art for a moment, consider again the problem of DNS. We don't want to have to go back to a single server for every DNS record we need to look up, so DNS clients routinely cache local copies of the DNS settings they've retrieved. These are obviously identity-bound elements—if I change the IP address of neward.net, there's exactly one "real" record that holds the original fact, namely, my DNS server. If clients cache off the record, how will they know when to retrieve the new data?

In this particular case, DNS effectively states that it can afford a certain latency and assigns each DNS record with a time-to-live (TTL) value. This value indicates how long the data described in the DNS record is guaranteed to be good, and it's the clients' collective responsibility to keep track of this value and "go back to the source" when this time is up to ensure they're working with the latest-and-greatest values. In this situation, the latency is acceptable—I can probably keep the old server running in parallel to the new server (presumably keeping the two servers identical in content and functionality) until I know the DNS records have propagated throughout the world's collection of DNS caches.

Ironically, given our concern here with synchronization and lock windows, since DNS data is typically read-only or read-mostly, we probably could treat it as immutable for all intents and purposes (see Item 38), which again helps facilitate its replication. In fact, there's almost no reason why read-only and read-mostly data of all sorts can't be replicated across the network because if there's no need for updates, there's no update propagation problem and therefore no reason not to replicate.

We can extend the DNS approach to a more generalized idea of leasing data from the centralized database. In essence, we'll "borrow" data from the central database by taking a TTL value along with the data itself and storing this value with the data when we copy it off into a localized copy of the schema. Then, when working with the data, if a comparison of the TTL value and the current time indicates the data has expired, we go back to the centralized database for a refresh. Again, it's not as useful for absolutely-must-be-accurate data like bank account balances and such, but within many systems it turns out that a fair amount of data can afford this degree of latency.

In fact, in some cases, you can combine the local database with optimistic concurrency (see Item 33) to run your application entirely off of the local database, only pushing data to and from the centralized database at well-known synchronization points (nightly, hourly, whenever the network is detected, and so on). Again, however, you need to make sure that the application can handle the inevitability of a concurrency problem, such as the price of an item having been changed in the central database on Monday, yet the order was placed on Tuesday with Sunday's prices because we hadn't yet synchronized. (What happens in this case is a business decision, by the way, not a technical one—don't make any assumptions about what should happen, or you'll likely end up with some very unhappy users and/or customers.)

Replication isn't going to solve every one of your scalability problems, but when carefully applied, it can release a great deal of tension against a single resource. The key to successful replication, in the end, is to make sure that the resources you replicate are identity-less because as soon as unique objects are replicated, we get into consistency issues that have no good all-purpose solutions.