Item 12: Build in monitoring support



Item 12: Build in monitoring support

So, now that the project's been deployed to production, how do you know it's still working?

Of course, I know, your code has no bugs, your code would never stop working, you've handled every possible exception, you've covered every possible error condition, and you've written code that's so robust that not even a power outage could keep it from working. Your SQL queries are triply redundant, so that an unexpected database schema change or upgrade will be flawlessly handled and gracefully accounted for. Your EJB calls are always wrapped in try/catch blocks, with code in place to gently deal with network outages. You've read Item 7 and memorized it, and your code reacts gracefully and recoverably to every possible error scenario. There's absolutely no way your code could break.

Unfortunately, you didn't write the whole thing yourself, and the college intern over there doesn't pay quite as much attention to code quality as you do.

Most of the time, the only way we know something's busted in production is when we get the phone call, either from the users themselves (assuming the application is an intranet one) or from upper management, who got the call from the support staff, who got the call from the user who was in the middle of placing the million-dollar order when suddenly they got a "404" from the browser and now can't get anything to work. This means that regardless of how quickly we move on the information, or how quickly we can move to fix whatever's broken, we're completely in a reactive mode. In this situation, we're always at the disadvantage; it's not like we can fix something that's not broken.

Once something goes wrong, however, it can be devilishly difficult to figure out exactly what part of the system is busted. Unless it's something horrendously obvious, like the entire EJB container has crashed, we're more or less reduced to examining whatever stack traces show up on the user screen to determine where the exception is coming from and why. Thankfully, Java exception traces tend to be pretty informative, but all it takes is one line of code somewhere in your application, like the following one:






try

{

  // Do work here

}

catch (Throwable t)

{ /* FIX LATER */ }


and suddenly, you don't even have a stack trace anymore—or worse yet, you have the wrong stack trace because now the exception is coming later in that method, because something else went wrong thanks to the first unhandled exception. (Of course, we both know that you would never write code like this, as Item 7 points out, but remember, that intern over in the corner hasn't learned his lessons the hard way yet.)

Rather than always starting the scenario at the disadvantage, instead, put some investment in at the beginning so that we can keep track of what's going on inside the system. Rather than having to rely on end users and management to tell us when something's wrong, build some hooks and switches into the application so that we can start to get an idea of what the system is doing and hopefully know before anyone else when something goes wrong.

For starters, take a page from the TCP/IP folks and come up with a simple test to make sure that everything's running the way it's supposed to, above and beyond any sort of business processing. In the TCP/IP world, of course, this utility is known as ping, and it is a lifesaver when diagnosing network problems because ping will tell you two things: (1) whether the target server is up (presumably, if it's not responding, it's down), and (2) how long it takes to get from your machine to the server and back (which gives you an idea whether perhaps some kind of intermittent network outage is occurring).

In enterprise systems, this ping, what's sometimes called the "happy bit," usually takes the form of a JSP page or servlet that, when hit with a request, simply checks to see whether all the expected servers and other paraphernalia are up and running. The Axis open-source Web Services package calls this JSP page happyaxis.jsp, and all it does is make sure that the Axis plumbing is installed and capable of executing correctly—it tries to load a few classes, and so on. This way, when a problem develops with an Axis-based Web Service endpoint, the first step can be to test happyaxis.jsp, to make sure that at least the Axis bits are "happy" and in place.

Within your own system, put in the necessary "happy bits" at every level feasible. Create a happysystem.jsp page (or equivalent) that will simply verify that necessary libraries can be loaded; do a Class.forName on a few to make sure everything's there. Build a "PingBean," a stateless session bean that offers up a single method, ping(), and in your happysystem.jsp page, do a JNDI lookup and method call on the PingBean to know that the link between the servlet container and the EJB container is alive and well. Put a single-row table, called PING_TBL (or whatever your database naming conventions demand that it be called), that contains exactly one row with well-known, expected data, as in the string "PONG",[6] so that your PingBean and your happysystem.jsp endpoints can both try accessing the database to make sure everything there is OK—and so on.

[6] The classic response to a "PING" request is to send back "PONG".

The point here is that when a problem occurs, you now have a tool in your toolchest that you can use to start eliminating potential problems. If, for example, the happysystem.jsp page doesn't come up at all, you know the problem at least starts with the servlet container or earlier—perhaps the server it's on is down, or the network between the server and the rest of the world is on the fritz, both of which will be discoverable via a ping request, and so on. If happysystem.jsp comes up but indicates that it can't execute the "happy table" SQL query, then it's time to look at the database. If everything on happysystem.jsp indicates success, it's time to start taking a hard look at your code.

Unfortunately, the "happy bit" test is still a reactive one: it waits for you to test it to see if everything is up but won't notify you in the event something goes down—you have to ping it. I suppose we could make the system administrators try to test the "happy bit" every 15 minutes or so to make sure the system is still up, but this has two problems: (1) your error-detection window is now 15 minutes wide, which means you could have an outage of 14 minutes and 59 seconds before you know a problem is up, and (2) you're really going to make your system administrators angry at you if you force them to check the system every 15 minutes. (Try it: doing anything, no matter how trivial, every 15 minutes is hideously hard, particularly if you're wrapped up in something else in the meantime.)

But this kind of repetitive work is ideal for the computer itself, so put the computer to work doing the notification for you. Set up a simple process that issues an HTTP request to the happysystem.jsp page every five minutes or so, or whatever your "acceptable outage" time window is, and uses some kind of HTML screen-scraping API to examine the results and make sure everything's up and running. (The open-source HttpUnit project turns out to be a really good tool for doing this.) In the event that this automated "heartbeat" discovers a problem, it can take one of several actions to notify somebody that something's "not happy": send an e-mail to a mailing list, send an e-mail to an alphanumeric pager, open an IRC or MSM or AIM or Jabber connection to a system administrator and/or developer, post an event to the Win32 Event Log or UNIX syslog, trigger an SNMP fault, and so on. Just as long as it's a channel where we have reasonable guarantee that somebody's listening, anything works.

So far, so good. We've got a tool that will tell us whether there are problems with the underlying plumbing, and we've managed to turn it into an alarm when things fail. We're just getting warmed up here, though. Not only do we want to know before things fail, but we'd also like to have some kind of window "inside" our code so that when things start to go bad, we can take a look at some reasonable guesses about why that might be.

The first part of getting that "window" into your code requires some effort, although not too much: we need to establish some kind of diagnostic log to which we can write debug and diagnostic messages. While the Servlet Specification provides a log associated with the ServletContext, it's alone among the J2EE Specifications in doing so—most notably, the EJB container lacks any such facility, despite the fact it's probably in the largest need of one. It's possible to use the servlet log to log servlet messages and some other mechanism to log EJB messages, but then we lose any sort of continuity in our messages that go from servlet to EJB and back again.[7]

[7] We'll only get this continuity in containers that are both servlet and EJB containers in one process, but since that's a large percentage of the installed J2EE container base, that's a safe assumption. Nevertheless, it bears saying that if the servlet and EJB containers are running in separate processes, it's (almost) impossible to get a single diagnostic log that logs messages from both environments without being hideously time-expensive.

A number of different logging implementations are available (the JDK 1.4 java.util.logging API and the Log4j open-source project are probably the two most popular); whichever you choose is really more a matter of personal choice than any technical rationale. Both offer the ability to "categorize" output messages, both in terms of verbosity level and what "subsystems" you're logging from or about. The upshot here, however, is that within your code, you make use of the log to track events occurring within the system:






// Using JDK 1.4 logging, just by way of example

//

public class LoginServlet extends HttpServlet

{

  static Logger logger =

    Logger.getLogger(LoginServlet.class.getName());



  public void doPost(HttpServletRequest req,

                     HttpServletResponse resp)

    throws IOException

  {

    // Note that we've entered doPost()

    //

    logger.entering("LoginServlet", "doPost");



    // Check our input data

    //

    if (request.getParameter("username") == null ||

        request.getParameter("username").equals(""))

    {

      logger.fatal("Cannot log in without username");

      throw new ServletException("Login failed");

    }



    // and so on



    logger.exiting("LoginServlet", "doPost");

  }

}


The granularity of the log messages can be as coarse or as fine as you like, but as a general rule of thumb, put as much into the log as you'd need to debug the servlet or EJB (or whatever) without a debugger running.

At first blush, that may seem excessive—after all, it's not like these logging calls are free—but the reason for having all that information there becomes apparent the first time you have to debug a problem on the production server that you cannot replicate on your machine, the development server, or the QA server. Having that information in the log suddenly makes it a lot easier to figure out what's going wrong than taking guesses (usually incorrect ones) based on a terse sentence or two.

If you're concerned about the performance of your system when writing out all these log messages (which, to be honest, you should be), there are a few things to remember.

  • Don't do a lot of string concatenation. Most of the logging implementations are pretty quick about determining whether the message even needs to be written to the log (based on some configuration properties), but they can't do anything about the string concatenation that takes place before the logging method gets called. Remember that, like most programming languages, in Java parameters must be evaluated before the method call can take place, and keep the parameters to your log messages simple; if all else fails, if you absolutely have to string a whole slew of information together into a single log message, test to see whether the log is turned on at that level (via the isLoggable method in the JDK 1.4 API, for example). For this reason, if you use toString to display the contents of an object's state, make sure that it doesn't go off and start a thousand-method-long call chain to walk through every owned field; either write toString to be logger-friendly, or else create a new method (e.g., toLogString) that is.

  • Log to someplace that's fast. Most of the time, diagnostic logs write out to the filesystem or the process console STDOUT or STDERR stream by default. Both the JDK 1.4 logging implementation and the Log4J implementation are extremely flexible, however, and allow for custom "log sinks" to be plugged in, so if the standard file logger implementations are spending too much time writing to disk for your taste, write a custom implementation that takes the message and hands it off to a separate thread to do the actual write. (If you do this, make sure you realize that you'll potentially lose log messages if the JVM process abnormally terminates for some reason.) One favorite trick is to write a handler that listens for socket connections on a well-known port number, and write log messages to any connected clients as they come in and throw away log messages when no clients are connected. (Log4J calls this the TelnetAppender, for example.)

  • Log to files for archival purposes. Even when you're not tracking down a problem, log messages to files so that in case something does suddenly go bad, you can go back to the files and look over what was logged in an attempt to figure out why. Again, if the file-writing code isn't as fast as you'd like, you can write your own custom handler doing file writes on a separate thread (with the caveat above). Alternatively, you can turn off logging to the file until a problem arises, but be aware that half the time, the problems are sporadic and impossible to predict in any meaningful way, which is why having the files capture all the logs is so useful in the first place.

  • Make the logging level a hot-configured data item. In other words, don't force a restart of the system to notice a change in the logging level; otherwise, turning up and turning down the logging level becomes more trouble than it's worth when trying to track down a problem, and the chances of it actually being used as a diagnostic utility decrease dramatically. See Item 13 for details on hot-configurable configuration data.

  • Keep the logging verbosity level low except when tracking down problems. Logging implementations use verbosity levels to allow you to mark certain log messages as higher priority than others; for example, in the earlier servlet code, the fact that username wasn't passed into the servlet when it was expected and required is a fatal error because the servlet can't continue processing without it. The fact that we entered (and later exited) the method is of much less importance by comparison. Normally, a production system will want to keep diagnostic logs turned up only high enough to capture errors (and sometimes warnings), in order to save both processing time and disk space if these logs are being stored to disk. Turn those levels back up to more verbose levels only when a problem occurs, and then turn them back down as soon as the problem has been corrected or eliminated.

Now, when the problem arises, you can check the logs for what might have gone wrong and have a leg up on finding and fixing the problem. (This also means that, as mentioned in Item 7, the bare minimum any exception catch handler should do is log the exception to a diagnostic log, even if no other error-handling heuristic comes to mind.)

The other half of our ideal monitoring situation is to keep track of how the system is performing, not only so that we can become aware of potential failures before they occur but also so that we can get a better idea of where the system might be bogging down due to a performance bottleneck or other constraint. While building such hooks is pretty much beyond the scope of this book, fortunately, the Java Community Process comes to the rescue once again with another specification, this time the JMX, a set of APIs for use by objects that wish to be managed by an external management tool. Several vendors, including several open-source projects like JBoss and Tomcat, already have built-in support for JMX, and the "J2EE Management" JSR brings JMX formally into the J2EE 1.4 environment. (For more details on JMX, see Java and JMX [Kreger/Harold/Williamson].)

All JMX technical details aside, the JMX APIs allow you to define managed objects, called MBeans, which can keep track of counters and also issue notifications in the event an MBean sees a performance statistic start to climb out of an acceptable range. For example, a servlet may use a "concurrent access" MBean to keep track of the number of concurrent requests executing within its doPost method, and when that number exceeds a reasonable number (5, 10, 20, whatever), issue a notification, since the calls aren't executing as quickly through the servlet as they should: it's possible that the servlet's calls to the database are taking far longer than they should, which probably indicates problems at the database (the disk is about to fill up, the database CPU is maxed out, and so on).

One of the best parts of JMX is that, thanks to the JMX vendors' GUIs, it allows system administrators not only to keep an eye on statistics and performance inside your code but also to gather statistics that are useful and important to upper management; for example, it would be trivial to create an MBean that tracked the number of successful transactions through the system against the number of failed transactions, partially completed transactions, and incorrectly formed transactions. (Managers just live on this stuff.) Another useful thing to track, keeping an eye on Item 58, is the number of failed logins; you can then issue an event notification when there are more failed login attempts than is reasonable, whether for a single user or across the system as a whole.

By the way, for the sanity of both you and your system administrators, make sure that these monitoring tools you use (the "happy bit" test, the "happy heartbeat" process, the diagnostic logs, and the JMX integration) are accessible remotely, so that an individual, namely you or your friendly system administrator, doesn't have to be standing over the machine to watch it. Most JMX providers are already hard at work providing remote access to the monitored resources; the "happy bit" tests are typically done from within the system, which already has remote capabilities built within it (HTTP, if nothing else); and the diagnostic logs can write either to a socket or to a file, both of which already have some decent remote support. So, in general, making these tools accessible remotely isn't a big deal. The key here is to avoid having your diagnostic logs write to System.out and/or System.err, since those are not exactly easy to access from a remote location.

If you're despairing over all the work implied in this discussion, remember that it isn't just about making managers happy with statistics or making system administrators happy by making it easier for them to determine when there's a problem in production, although those are noble goals and worth the work. Having all of the above in place will make your life easier, as well, since now you'll have a collection of tools you can use to help detect and diagnose a problem once it occurs.

This actually has greater implications than you might believe at first, but stop and think about it for a moment: If the database behind your system suddenly goes down or starts behaving unpredictably for some reason, who's going to get the blame when the Web site in front of the users suddenly starts throwing exceptions everywhere? Unless you have tools in place to help diagnose the root cause of a problem within your system, you'll be the one "under the gun," so to speak, until you can prove otherwise. And if you can make those tools available to system administrators, so that they can do the diagnosis instead of you, that's one less interrupted night on your part. And that, my friends, is more than enough justification for all of the work required to put those tools into your toolchest.