April 8, 2010, 11:39 p.m.
posted by oxy
Item 4: Keep data and processors close togetherIt's a simple idea, but one that shows up time and again in a variety of guises and forms: when you want to work with something, keep it close to you so you don't have to go chasing all over the network to find it. This is why CPUs have on-chip cache, why video cards come with their own memory, and why performance experts suggest caching in your application as a way to speed up its processing. In many respects, this is just another expression of Item 17 since there's really no intrinsic problem with keeping data in its permanent home (in J2EE applications, typically the remote database), except that it costs so much to reach across the network to fetch that data every time we want to work with it. If round-trips were free, we wouldn't need to cache data. But they're not, so we do. We also need to make sure that caches don't get so big that they cause any sort of OutOfMemoryErrors (see Item 74). By the way, local and/or in-process databases (see Item 44) are also a great place to keep cached data. Sometimes, however, the data can't come to you. One frequent reason is that doing so would require locks against the database for far longer than is desirable. This is part of the problem behind accessing entity beans even from a Session Facade [Alur/Crupi/Malks, 341], since the entity bean lives in the EJB container. As Item 23 discusses in more detail, passing data in bulk takes care of moving the data from the client to the EJB layer in a relatively inexpensive fashion, but the data still needs to be migrated from the database layer to the EJB container and back. So, when the data cannot move to the processing, move the processing close to the data. If you think about it, this is much of the stated reason for using a middleware technology in the first place: by placing the business logic on the server (and not the client), the middleware system could keep data close to the processing associated with it. Granted, those justifications were originally used for TP Monitors, but much of the current EJB product literature implies it: "Take full advantage of our product's caching" and so on. Unfortunately, as we've already discussed, since the data still has to end up stored in the relational database behind the EJB container, we're still looking at additional round-trips to put it there, and again, those round-trips don't come for free. (Note that in the scenario where the EJB container offers it, developers can flip a vendor-specific "exclusivity" bit to tell the EJB system that it is the sole owner of the database, and the EJB system now can cache and locally hold data as much as it wishes, thus avoiding a few round-trips. You lose vendor neutrality—see Item 11—when you do this, but sometimes the gain is worth the cost.) You can keep data and processing close together in a number of other ways; one such way is to make use of every bit of the SQL language available to you. For example, if you need to calculate the average age of all people in the system whose last name is "Nelson," effectively you have two options: (1) either in a session bean or in your client code, pull all Person objects in the database with the last name equal to "Nelson" across to the client/session bean, do the calculation yourself, and return the results, or (2) let the relational database do the work by using the SQL-92 AVG function by executing SELECT AVG(age) FROM person WHERE last-name='Nelson'. This will return a single row containing a single column, a numeric value containing the average age of all the Nelsons in the database. In other words, all the calculations have been done on the server, where it was trivial to examine all of the rows in the database that met the predicate criteria, and no additional round-trips were required. As another example, it's not uncommon to see newsgroups and mailing lists littered with a frequently asked question, "How can I know the number of rows in a ResultSet?" The short answer is, "You can't—there's no getRowCount function on ResultSet." Usually another response comes in to suggest that if you use a JDBC 2.x driver that supports scrollable ResultSet objects (see Item 49)—"You can move to the end of the ResultSet and see which row you're on"—but this has the unfortunate side effect of either pulling all the data across the wire or forcing the database to navigate to the end of the ResultSet on its side and send back that window of data to you. A better answer is to use SELECT COUNT() on your WHERE predicate first, before doing the actual SELECT for the data:
Statement s = . . .; // get the Statement from the usual
// places
ResultSet rs_count =
s.executeQuery("SELECT COUNT(*) FROM person " +
"WHERE last_name='Nelson'");
if (rs_count.next() == false)
{
// Something went horribly wrong
}
int count = rs_count.getInt(1);
ResultSet rs =
s.executeQuery("SELECT * FROM person
WHERE last_name='Nelson'");
for (int i=0; i<count; i++)
{
// . . .
}
There is one small race condition—unless protected by a transaction, there is a small window of opportunity for another client to modify the contents of the Person table, thereby making the count incorrect. Either accept that the count could be off, or bracket the two statements with a transaction; either way, despite the transaction lock, this approach will yield better scalability than pulling all of the data across simply to count the number of rows, particularly if you end up not needing anything beyond the first set, such as in search results. Functions such as COUNT, SUM, and others are sometimes called aggregate functions because they operate across a subset of the data in the table, producing an aggregate result based on that subset. In addition, thanks to the power of SQL's closure (see Item 41), these aggregate functions can be used over any relational result, such as views or nested SELECT statements. SQL-92 and its later cousin, SQL-99 or SQL 3, require a number of simple aggregate functions, such as COUNT, SUM, AVG, MIN, and MAX, all of which do exactly as you would expect, and a number of database products also offer vendor-specific hooks to introduce your own aggregate function capability. Many database vendors go well beyond this; for example, Microsoft SQL Server offers the ability to create a table whose definition includes a column that is in fact the result of a user-defined function: -- SQL Server syntax; other databases provide something -- similar -- -- SQL Server syntax for creating a user-defined function -- CREATE FUNCTION ConcatName(@LHS nchar(80), @RHS nchar(80) ) RETURNS nchar(161) AS BEGIN RETURN ( @LHS + ' ' + @RHS ) END -- -- Create a table that defines a column that -- uses the above function -- CREATE TABLE Person ( ID int PRIMARY KEY, first_name nchar(80), last_name nchar(80), age int, full_name AS ( dbo.ConcatName(first_name, last_name) ) ) When retrieving values from this table using a SELECT statement, SQL Server will call the ConcatName function to create the columnar data returned for full_name; this means that we can get the full names of all persons in the database whose last name is "Nelson" simply by executing this code: SELECT full_name FROM person WHERE last_name='Nelson'; -- -- Assuming we have rows: -- Row 1: 1, "Steve", "Nelson", 40, (ConcatName()) -- Row 2: 2, "Lori", "Nelson", 25, (ConcatName()) -- Row 3: 3, "Kirsten", "Nelson", 12, (ConcatName()) -- Row 4: 4, "Marnie", "Nelson", 7, (ConcatName()) -- -- Then the query above returns -- "Steve Nelson","Lori Nelson","Kirsten Nelson","Marnie -- Nelson" Again, all of the processing is done on the server, so no additional work client-side is necessary. Contrast that with how the corresponding logic in the session bean would look, compound that with the fact that a session bean doing this concatenation on an entity bean would implicitly make round-trips to the database to obtain the first_name and last_name fields, and you start to see where this becomes so powerful. The drawback here, of course, is that this is horribly vendor-specific and thus a severe hindrance if you're concerned about portability across database products (see Item 11). In fact, the ultimate expression of Item 4 is to use database stored procedures to do the processing—since the data lives in the database, and processing frequently is more complex than what can be expressed via an aggregate function or vendor extension, a stored procedure offers the ability to do a full procedural, step-by-step massaging of data before returning the results or modifying the data in the database. Stored procedures have two major flaws associated with them, however: they're written in a language other than Java (thus forcing you to learn a new language), and they're not portable to other database systems. Fortunately, help may be on the way here—as part of the SQL/J Specification effort, the various database vendors hope to standardize the use of Java in stored procedures, much as many of the "big name" database vendors, such as Oracle and IBM, have already done. This would address both problems, but seeing that in practical form could be years off. (On the other hand, depending on how complex your stored procedures are, porting them from one vendor's dialect to another may not be all that onerous.) Ultimately, whether you cache data close to the processors or move the processing closer to the data, keep these two as closely bound as possible within your architecture, so as to minimize the amount of traffic that has to take place when processing occurs and to avoid the cost of moving that data across the network (see Item 44). |
- Comment