Item 50: Tune your SQL



Item 50: Tune your SQL

Quiz time: When are two logically equivalent SQL statements not equivalent? Consider the following pair of logically equal SQL statements:






SELECT * FROM Table WHERE column1='A' AND column2='B'



SELECT * FROM Table WHERE column2='B' AND column1='A'


Which would you say executes faster?[6]

[6] This example appears in SQL Performance Tuning [Gulutzan/Pelzer, 24].

Answer: You can't tell, and neither can the optimizer in most databases. But if you happen to know, for your particular database schema, this particular client, and/or this particular query, that the likelihood of column2 being B is a lot less likely, then the second SQL statement will execute faster than the first, even though the two are logically equivalent. (By the way, never do this for Oracle databases when using the cost-based optimizer—it will do exactly the wrong thing.)

Or, as another example, how about these two statements?






SELECT * FROM Table WHERE column1=5 AND

    NOT (column3=7 OR column1=column2)



SELECT * FROM Table WHERE column1=5 AND column3<>7 AND

    column2<>5


Answer: For five of eight popular databases, the second turns out to be faster.[7] Again, these are logically identical statements, so the actual results returned will be the same; it's just that the database optimizer treats the second one in a different fashion than the first, thereby yielding a better response time.

[7] This example appears in SQL Performance Tuning [Gulutzan/Pelzer, 16].

Tuning SQL remains the number one optimization you can make when working with a relational database. Despite what JDO and entity bean vendors have been trying to achieve for the last five years, we're still in an era where SQL matters. Yes, vendors have made huge strides in making their objects-first persistence models not suck as badly as they first did, but the fact remains that if the database itself can't tell the difference between the two statements, it's highly unlikely that your object-relational layer will be able to. This is why it's crucial for an object-relational system to offer you, at the least, some kind of hook point (see Item 6) to pass in raw SQL that you can tune and optimize as necessary for those queries executed most often.

By the way, bear in mind that for some databases, these two statements are considered to be entirely different and therefore require reparsing, replanning, and reexecution:






SELECT column1*4 FROM Table1 WHERE COLUMN1=COLUMN2 + 7



SELECT Column1 * 4 FROM Table1  WHERE column1=(column2 + 7)


That is to say, even though these statements are precisely identical in what they're doing, because the capitalization and whitespace used within them are different, the database treats them as separate and unrelated statements. With a tool generating your queries, we would hope that they're being generated in a consistent style, but can you be certain? (Ideally, you'd know whether this sort of thing bothers your database before worrying about it too much, but that means you're OK with a priori vendor awareness like that; see Item 11.)

In many cases, you need to know what SQL is actually being executed against your database before you can think about trying to tune it. For some object-relational systems this can be a difficult prospect if they don't expose the actual generated or dynamically constructed SQL code to you. Fortunately, an answer is only a short step away.

First, most database products offer some kind of database-based view of queries being executed against a given database instance, so it's usually pretty easy to fire up the database profiler tool and take a look at the actual SQL. (While you're at it, assuming your database provides query analysis as part of that same profiler, take the SQL queries and run them against the profiler to see how the database will attack executing this particular query; see Item 10.) This should tell you how your favorite object-relational tool is generating the SQL queries and thereby give you a good idea whether SQL optimization is in order.

If you're working with a database that doesn't provide such a tool (time to get a new database, in my opinion, but sometimes you have to work with the tools at hand), assuming your object-relational layer is still going through JDBC to talk to the database itself, you can play a small trick on the object-relational layer by handing it a JDBC driver that "leaks" the actual queries being fed through it. It's the P6Spy driver, available from http://www.p6spy.com, and its operation is deceptively simple: it exposes the same JDBC driver interface that every other JDBC driver on the planet provides, but it doesn't do any actual work, instead delegating that to a JDBC driver for which you provide configuration data. Instead, the P6Spy driver simply echoes the SQL queries to multiple sources, including Log4J logging streams.

For those situations where your object-relational layer isn't doing the optimal SQL thing and the object-relational layer doesn't provide some kind of hook point to pass in optimal SQL, you have a few options.

  • Switch tools. There are plenty of other object-relational tools out there, both commercial and open-source, that you shouldn't feel "stuck" with one that doesn't give you the necessary optimization hook points.

  • Follow the Fast Lane pattern [Alur/Crupi/Malks, 1st ed.], which advocates the idea of doing direct JDBC access against the database, thereby giving you direct control over the SQL being sent. In the case of EJB-based access, this means either doing direct JDBC calls from your session beans or rewriting your CMP entity beans as BMP beans. Unfortunately, this pattern has a number of direct drawbacks (which is probably why it's not present in the second edition), including the fact that if the object-relational layer is doing any sort of cache management, you'll be bypassing that cache management and any implicit Identity Map [Fowler, 195] that's being maintained. Perhaps you can live with the loss of the caching behavior; however, if the Identity Map [Fowler, 195] is being maintained in order to allow the object-relational layer to hold off sending any uncommitted changes to the data, you'll bypass that and create a nasty little concurrency problem for yourself.

  • Pull a bait-and-switch with your object-relational tool, taking a cue from the P6Spy driver (a classic Decorator [GOF, 175] if ever there was one). Write a JDBC driver look-alike—that is, a class that implements the JDBC driver interfaces—that listens for particular queries that match the queries you want to optimize, and replace those queries with hand-tuned, optimized SQL before handing the query off to the real JDBC driver you've wrapped around. You do take an extra layer of method calls as a hit to the driver, but assuming you've done your homework (see Item 10), the gains from the tuned SQL will more than offset the cost of the additional method call.

Whichever method you choose, before you start caching off JNDI lookup results as an optimization, take a hard look at the SQL being executing on your behalf—more often than not, optimizations in the SQL can yield a far greater return on your investment in time and energy than anything else you attempt. Even if your CMP entity bean layer is doing some kind of adaptive SQL generation scheme, you'll want to know about it and make sure that your database administrator is doing his or her part not to work against that.