Item 23: Pass data in bulk



Item 23: Pass data in bulk

Consider, if you will, your run-of-the-mill entity bean (either container-managed or bean-managed, it makes no real difference) representing some detail about a person, perhaps from a U.S.-localized address book application. When used from the client, the entity bean looks something like this:






Person p = personHome.create(new SSNPK("555-12-9876"));

String fullName = p.getFirstName() + " " + p.getLastName();

String streetAddr = p.getStreet() + "\n" +

  p.getCity() + " " + p.getState() + " " + p.getZip();



. . . // Get some user input



p.setFirstName(firstName);

p.setLastName(lastName);

p.setStreet(street);

p.setCity(city);

p.setState(state);

p.setZip(zip);


Many readers familiar with EJB see a huge flaw with this code—it's too "chatty," making multiple method calls on the entity bean to set the necessary data on the bean instance. Each of these calls is a remote call, which in turn means you're making a round-trip per get and/or set call, thus violating Item 17 in a big, big way. Not only is this introducing latency into the application, it's also tanking scalability because this entity bean is being asked to do lots of "little" operations that don't justify the cost of making a round-trip.

The cost of this field-by-field approach isn't just in the networking layers; conceptually, the six set calls are all taking place as part of a single business operation and should probably be protected from modification between set calls by a transaction. In fact, however, all calls to entity beans are protected individually by their own distributed transaction, so instead of doing all six calls in a single transaction, each set call creates a transaction, enlists the database resource on the transaction, sets the data, runs through the two-phase commit protocol to commit the transaction, and tears it back down again. Six times we have to run through this nontrivial exercise.

Worse, we're still not protected completely from data corruption; because the transactions managed by the container are on a per-method basis, there are five windows, one between each of the set calls, where another client can come in and change the data on the bean. As a result, it's not difficult to imagine a scenario something like this:






Client 1 wants to set bean to:

Ted Neward, 1 Artesia Way, Davis, CA 95616

Client 2 wants to set bean to:

Ted Neward, 1 Microsoft Way, Redmond, WA, 55512



Client 1 calls setStreet()

Client 2 calls setStreet()

Client 2 calls setCity()

Client 1 calls setCity()

Client 1 calls setState()

Client 2 calls setState()

Client 2 calls setZip()

Client 1 calls setZip()



Bean is now set to:

Ted Neward, 1 Microsoft Way, Davis, WA, 95616


This is obviously not a great state of affairs. Even though transactions were used, even though full synchronization was used, we still got semantic data corruption—data that's syntactically legal yet still incorrect. Even if there is a city named "Davis" in Washington, it won't have the zip code of 95616—that's reserved for the city named Davis in the state of California. Moreover, I can guarantee there's no such address as "1 Microsoft Way" in Davis.

Faced with this, your first reaction might be to solve the problem by taking out the transaction on the client side, but as Item 29 shows you, client-side transactions are a slippery-slope decision that can quickly turn into an evil thing that no self-respecting J2EE programmer would ever claim to have authored. We want to execute all six set calls where we can take out a transaction without the commensurate cost.

The generalized solution, then, is to pass data in chunks large enough to justify the overhead of the remote call. In short, don't pass data one element at a time, but pass it in bulk: either in whole-object chunks, or even in sets of objects. (This is the basic idea behind the IBM/BEA proposal for Service Data Objects and is arguably just one step shy of a procedural-first persistence layer, as described in Item 42.) For those of you familiar with marshaling terminology, we want to pass-by-value, instead of the default pass-by-reference approach used by entity beans and/or other distributed object technologies.

One approach is to create Data Transfer Objects [Fowler, 401], objects whose data representations are exactly those of the entity beans or persistent objects they represent, or at least something close to them:






public class PersonDO

  implements java.io.Serializable

{

  public String firstName;

  public String lastName;

  public String street;

  public String city;

  public String state;

  public String zip;

}



public interface PersonTransferBean implements SessionBean

{

  // mandatory EJB methods left out for simplicity



  public PersonDO getData();

  public void setData(PersonDO data);

}


Notice how PersonDO implements the Serializable interface—this ensures that when PersonDO is sent across the wire, it is marshaled by value instead of by reference. This means rather than sending a kind of pointer (stub) to the recipient, we send a complete copy of the object, so that all of the data will remain local in the target JVM. Now, when it's time to update data on the PersonBean instance, we can pass all the data at once, instead of dribbling it over a small piece at a time. We lose some of the "object-orientedness" of the system because now we have to abandon the traditional get/set property idiom that has been a part of Java since its early beginnings, but at least this way we're avoiding massive performance issues. What's better, we can put the setData call under a transaction, thereby ensuring that all six calls will follow ACID transactional semantics.

If you're not keen on writing a Serializable version of each entity bean class (and I can't say that I would blame you if you're not), remember the goal here is simply to pass all the data across in one bulk network call, not to partition the data in any sort of meaningful way. So any sort of bulk data pass-by-value approach would work, including either passing data in Java collection class implementations, which are all Serializable, such as HashMap or ArrayList, or both—an ArrayList of HashMap objects is not an uncommon approach

A drawback to passing data this way is the lack of type safety in doing so. With a transfer object class per entity bean approach, where each of the fields is strongly typed (as in any Java object definition), the compiler can catch when a typo creeps in and warn you if you try to access the lsat-Name field instead of lastName. If the field name is a key to a Map value, on the other hand, the compiler won't validate your code (to avoid typos), and you won't notice the problem until runtime (hopefully as part of unit tests to verify the code is good, or during the QA process).

Alternatively, you can use a RowSet to pass data across, since RowSet objects are Serializable objects themselves. One of the interesting properties of RowSet implementations is that they typically operate in a disconnected manner. Unlike a ResultSet, the RowSet doesn't hold an active JDBC Connection back to the database to retrieve data in chunks; instead, all the data in the ResultSet is copied to the RowSet and held locally, within the RowSet, so that when the RowSet is serialized, all the data is serialized with it. What's more, JSR 114 and J2SE 1.5 are defining a set of standardized RowSet implementations with a variety of interesting features, including the ability to hold more than one set of tuples (result sets) as part of the RowSet.

It's fair to ask at this point whether there's anything special about the Serializable format, and of course, the answer is "not really"—it's just a well-known binary format. Thus, as mentioned in Item 15, another acceptable Serialization format is that of XML, since other platforms, most notably .NET, can consume XML much more easily than they can a binary format not intrinsically known to them. Toward this end, Sun created the WebRowSet. Although not formally part of J2EE, it has been available from the Java Developer Connection in beta form since 2000 and is now in the process of being ratified under JSR 114, along with several other RowSet implementations. The WebRowSet is an implementation of RowSet that extends CachedRowSet but adds two new methods—toXml and fromXml—each of which does exactly as its name implies: converts a RowSet to an XML Infoset instance and back again, respectively. It's a proprietary XML format, to be certain, but at least it's in XML, which is still better than a proprietary binary format when interoperating with other platforms (see Item 22 for more on interoperability).

Be aware, by the way, that when using entity bean CMP implementations, it's still entirely possible that multiple round-trips are being executed between the container and the database—for example, depending on how the transactional affinity is marked on the session bean "fronting" the entity bean, you could be making separate trips due to the container's need to start and end transactions on each call (see Item 31). In some cases, the EJB container has no choice but to generate some truly brain-dead code for CMP-marked entity beans, such as a standalone SELECT on each entity bean get call and UPDATE on each set call, due to the lack of any standard hints to the container about this bean, such as a read-only flag or a dirty bit that can be manually inspected and used. Vendors frequently offer such optimizations, but taking advantage of them renders your code vendor-specific (see Item 11). For this reason, running a "spy" JDBC driver or using your database's monitoring tools to look at the SQL (see Item 10) is crucial to understanding exactly how many round-trips that CMP entity is generating.

Passing data in bulk has its limitations, too—passing large amounts of data across the network repetitively is no better than making multiple round-trips, since now you're soaking up network bandwidth and forcing the communications layer to expend significant effort to marshal, transmit, receive, and unmarshal that data. Exercise your own judgment in whether a particular large data item is better sent across by reference or by value, based on how clients will (or won't) use it; pay particular attention to collections and Serializable objects, since thanks to rules of Serialization it's not uncommon for just one reference to bring along a whole slew of other objects you never would have thought should be serialized (see Item 71). Remember, the goal is to avoid spending excessive time on the network, not to be dogmatic about one approach or another.