Sept. 21, 2009, 9:03 p.m.
posted by oxy
Item 7: Be robust in the face of failureEvery developer must squarely face the brutal fact that "stuff happens"—not only will code have bugs, but databases will run out of disk space, routers will go down, the power will go out (and the UPS will expire before it can come back on), servers will be hacked, and "absolutely safe" operating system patches will turn out to be anything but. Notice the very carefully chosen verbiage here: not "can" fail, but "will" fail. In some circles, the term defensive programming comes to mind, the idea of never assuming that callers will in fact call your methods correctly, so you assert on every parameter, validate every return value, and so on, assuming that every caller of your method is pathological and wants to break the code. I don't necessarily subscribe to that particular mind-set; I believe that most of the time, code within the component can be trusted to "follow the rules," but any call coming from outside the component (including user input in the case of a servlet/JSP) definitely needs to be validated six ways from Sunday before being passed on and processed. But it's more than just asserting every parameter. At a microcosmic level, this means that when writing code, you need to think about all possible failure scenarios: what happens if the call to the EJB container throws a RemoteException, meaning the RMI plumbing had a problem satisfying the request, or the database throws a SQLException, or the parameters passed in to your session bean aren't within acceptable bounds? It's not just a matter of catching the exception and putting up a "something went wrong" message to the user—some kind of reasonable failure-recovery policy must be in place. For example, if the database throws a SQLException, is it because the SQL was malformed, or because the database didn't respond? If it's the former, it's probably OK to just tell the user something went wrong and try again; if it's the latter, it might be better to put the system into a kind of panic mode until the database can be reached again. At the very least, the system administrators need to be notified that the database was out for some period of time. Part of thinking about failure in code means handling Java exceptions correctly. First problem: we all know that any remote method called from an exported, remote object in Java RMI is capable of throwing a java.rmi.RemoteException; it's one of those things that RMI developers over the years have come to despise about RMI. It's actually a pity because a tremendous amount of information comes bundled in a RemoteException, most of which is completely ignored when you write code like this:
try
{
remoteObject.someRemoteMethod();
}
catch (RemoteException remEx)
{
System.err.println("Error in calling RMI method!");
// Beats the heck out of me what went wrong, but that's
// OK, I logged it to the console, right? Besides, this
// is just to keep the compiler happy, it's not like a
// remote call will ever fail or anything...
}
When the RemoteException is thrown, all of the diagnostics carried as part of the exception are completely ignored. Was it a problem on the client, on the server, in between? Was the object you tried to call suddenly inaccessible? Was it a problem in marshaling, indicating that somehow the stubs and skeletons are out of sync? Or is it perhaps that the server specified in the lookup doesn't exist or can't be found, probably a failure of the TCP/IP stack? For example, the ConnectException indicates that the client had difficulties finding the servant object on the server for some reason—assuming the underlying TCP/IP stack is still good (you can ping the other machine), this usually indicates the servant object is no longer available, usually because of a crash. This is not to be confused with the NoSuchObjectException, which is most often thrown when the client holds an old reference to a servant object that no longer exists despite the stub's insistence that it should. Catching the proper RMI exception can offer up a world of diagnostic information to the system administrator and/or support staff (that's often you, by the way) about what exactly went wrong where. Take a harder look at the RMI exceptions defined in the various java.rmi.* packages next time you've got a few moments between marathon coding sessions, and write catch handlers that react appropriately to each kind of RMI exception. (Bear in mind, too, that if you're working with a vendor-proprietary protocol stack, like BEA's T3 protocol, there may be new and/or different exceptions thrown there, too, and you'll want, if not need, to take a look at those as well.) It's not just RMI that suffers from this programmer sloppiness syndrome of "catch the base class exception," either—JDBC frequently sees developers catching just SQLException, ignoring the exception itself, and doing some kind of super-generic error-handling code, like writing to a log file. Once again, while the JDBC Specification itself doesn't define a large taxonomy of possible exception types like RMI does, it does note that vendors are encouraged to do so for their own purposes, and in fact many do. Or, alternatively, the SQLException type defines a place for vendor-specific product error codes, which can in turn offer up much greater detail about what just went wrong. In addition, the SQLException type defines a "next exception" property, allowing SQLException instances to chain on top of one another as the need permits. When's the last time you actually reported this information to anybody but a log file? Considering that many system administrators also know something about database products, it might not be a bad idea to have specific error-handling logic for dealing with the common problems expressed by vendor product error codes. Oh, and by the way, when's the last time you checked for SQLWarning instances on a Connection, Statement, or ResultSet? Or do you, like 99.9% of the other JDBC programmers out there, simply ignore their existence? Inside of servlets and JSPs, in particular, exception-handling policies become particularly important, since the last thing you'll want is for your end users to see a stack trace when some unexpected error gets tossed out of your JSP. (This is important not only from a public-relations perspective but also from a security perspective—it's amazing how much information a single stack trace conveys about the architecture and general structure of a system, information that an attacker can put to effective use.) This means that every one of your JSP pages should have its errorPage directive set, pointing either to a specific error page for that particular part of the application or else to a generic page that presents a message like "We're not sure what just happened, but we logged it, e-mailed the support staff, and automatically logged a $5 discount coupon in your name, so please don't hold it against us and try again, OK?" This also means that every one of your top-level servlets (i.e., servlets that were directly invoked by user actions, as opposed to servlets to which you chained from a different servlet) and filters must be wrapped in try/catch blocks that handle all possible exception scenarios—that means catching Throwable, by the way, not just ServletException. (J2EE 1.4—i.e., Servlet 2.4 and JSP 2.0—provide some container-managed error-handling facilities that can mitigate some of this; use them when you can.) EJB, too, requires some careful exception-handling consideration. This time, however, it's not what goes on inside the bean that requires such sensitivity but how clients should react to exceptions thrown out of the bean. For example, we know that throwing an exception out of a Required-marked transactional bean method means that the transaction is implicitly rolled back, but what happens to the bean itself? Is the bean still good? Can we make further method calls on the bean to ascertain what, exactly, went wrong with that last call? The EJB Specification draws a distinct difference between application exceptions (those exceptions that are domain-specific and inherit from neither RuntimeException nor RemoteException) and system exceptions (those that represent errors at a level below that of the application domain itself, such as underlying problems connecting to the database and such), and defines a new exception type, the EJBException class, which inherits from RuntimeException and stands as a kind of system exception wrapper. When an exception is allowed to leave the call generated by an EJB interceptor, the EJB container takes some drastically different action based on the kind of exception thrown. When an application exception is thrown out of a transactional method, the container figures that the client needs to see the actual domain invariant that was violated and rethrows the exception back to the client. In this situation, the transaction itself is left alone, giving the client a chance to recover from the error scenario. In the case of a system exception, the container is not nearly so forgiving—since the state of the bean itself is no longer certain (after all, an unexpected NullPointerException does wonders to reduce the stability of your code), the container rolls back the transaction and marks the bean instance as bad, thus forcing the container to discard it entirely. Now, depending on whether the caller is the "root caller" of the causality, the client will see either a returned TransactionRolledBackException or TransactionRolledBackLocalException, or else a RemoteException or EJBException. The situation gets even more interesting when we consider the Web Service endpoint behavior as described by EJB 2.1, since now we can't even pass an exception object back across the wire—instead, the client's going to have to suffice with a standard SOAP:Fault code, which can be pretty unspecific unless you know what's coming back and code your client accordingly. (By the way, some J2EE books mention that you need to force your application exception types to implement java.io.Serializable because these exceptions will be carried across the network; as it turns out, you don't need to worry about this. The base Throwable class, ultimate ancestor of anything throwable in general, already implements the Serializable interface.) But it's more than just thinking about code. Enterprise applications have much more stringent uptime requirements than just about any other form of software (with obvious exceptions, like embedded software controllers in airplanes and nuclear reactors, for example). At an architectural level, we need to think about failure scenarios at a more macrocosmic level: What happens if the database server goes down entirely? What happens if the EJB container does? The J2EE vendors, in fact, will tell you that you don't need to worry about this because their feature-filled, incredibly-expensive-but-worth-it container will provide all sorts of fault-tolerance and failover capabilities for you. As much as I, more than anybody, would love to buy into that, the brutal truth is that the vendors don't—and can't—provide a complete, covered-from-every-angle sort of failure-recovery policy. For example, take a simple scenario. In a Web application's servlet controllers (see Item 53), frequently the servlet needs to access parameters submitted by the HTTP request in order to figure out what programmatic action to take, even what JSP to forward to for output. What happens if that parameter isn't there, or isn't what's expected? If you're using session state to hold user-transient state, what happens when a user bookmarks a page deep inside the page flow, expecting to be able to come back to it tomorrow? We also have to face the fact that despite our best efforts, despite how much time we spend trying to predict and prepare for every failure scenario, failures we never anticipated will still happen. None of us are perfect, and the possibility of bugs, missed use cases, or weird combinations of user actions leading to unpredictable behavior exists. Because we can't prevent those situations, we need to have a plan in mind for how to react to them: how to fix corrupt data, how to deploy a fix to the production environment (ideally without forcing a restart of the container or reboot of the server), even how to apply vendor patches to your servlet or EJB container to fix a vendor bug you've discovered. As a corollary, however, once you read Item 60, you'll also realize that in addition to "failing robustly," you also need to "fail securely"—in other words, don't accidentally hand out information (like the complete stack traces in a servlet-based front end) that attackers can use to gain entry to the system. Make sure that any failures are reported (and in fact verified) and that repeated failures are brought to an administrator's attention somehow; repeated failures are often a sign that an attack is taking place. Any and/or all of these things will happen to you. To stick your head in the sand and pretend otherwise, or to believe for even a moment that "those are things the system administrators have to worry about, not me" is a recipe for a very long night, struggling to figure out how to do all of the things mentioned above but without the luxury of time or experimentation. Don't sign yourself up for this sort of abuse unless you really like the idea of interviewing a lot. |
- Comment