July 22, 2011, 2:43 a.m.
posted by oxy
Item 43: Recognize the object-hierarchical impedance mismatchXML is everywhere, including in your persistence plans. Once we'd finally gotten around to realizing that XML was all about data and not a language for doing markup itself as HTML was, industry pundits and writers started talking about XML as the logical way to represent objects in data form. Shortly thereafter, the thought of using XML to marshal data across the network was introduced, and SOAP and its accompanying follow-up Web Service specifications were born. The problem is that XML is intrinsically a hierarchical way to represent data—look at the XML Infoset Specification, which requires that data be well formed, meaning the elements in an XML document must form a nice tree of elements (each element can have child elements nested within it, each element has a single parent in which it's nested, with the sole exception of the single "root" node that brackets the entire document, and so on). This means that XML is great for representing hierarchical data (hence the title of this item), and assuming your objects form a neat hierarchy, XML is a natural way to represent that data (hence the natural assumption that XML and objects go hand in hand). But what happens when objects don't form nice, natural trees? Hierarchical data models are not new; in fact, they're quite old. The relational data model was an attempt to find something easier to work with than the database systems of the day, which were similar in concept, if not form, to the hierarchical model we see in XML today. The problem with the hierarchical model at the time was that attempting to find data within it was difficult. Users had to navigate the elements of the tree manually, leaving users to figure out "how" instead of focusing on "what"—that is, how to get to the data, rather than what data they were interested in. With the emergence of XML (and the growing interest in "XML databases," despite the inherent ambiguity in that term), it would seem that hierarchical data models are becoming popular once again. While a full discussion of the implications of a hierarchical data model are beyond the scope of this book, it's important to discuss two things here: when we're likely to use a hierarchical data model in J2EE, and what implications that will have for Java programmers. While the industry currently doesn't recognize it, mapping objects to XML (the most common hierarchical storage model today) is not a simple thing, leading us to wonder whether an object-hierarchical impedance mismatch—in other words, a mismatch between the free-form object model we're all used to and the strictly hierarchical model the XML Infoset imposes—is just around the corner.[3] In fact, given that we now have vendors offering libraries to map objects to XML for us, as well as the more recent Java API for XML Binding (JAXB) standard to help unify the various implementations that do so, it may be fair to infer that mapping objects to XML and back again isn't as simple as it seems—granted, simple object models map to XML pretty easily, but then again, simple object models map pretty easily to relational tables, too, and we all know how "easy" it is to do object-relational mapping.
Much of the problem with mapping objects to a hierarchical model is the same problem that occurs when mapping objects to a relational model: preserving object identity. To understand what I mean, let's go back for a moment to the same Person object we've used in previous items:
public class Person
{
// Fields public just for simplicity
//
public String firstName;
public String lastName;
public int age;
public Person(String fn, String ln, int a)
{ firstName = fn; lastName = ln; age = a; }
}
Again, simple and straightforward, and it's not overly difficult to imagine what an XML representation of this object would look like: <person> <firstName>Ron</firstName> <lastName>Reynolds</lastName> <age>30</age> </person> So far, so good. But now, let's add something that's completely reasonable to expect within an object-oriented model but completely shatters a hierarchical one—cyclic references:
public class Person
{
public String firstName;
public String lastName;
public int age;
public Person spouse;
public Person(String fn, String ln, int a)
{ firstName = fn; lastName = ln; age = a; }
}
How do you represent the following set of objects?
Person ron = new Person("Ron", "Reynolds", 31);
Person lisa = new Person("Lisa", "Reynolds", 25);
ron.spouse = lisa;
lisa.spouse = ron;
A not-unreasonable approach to serializing ron out to XML could be done by simply traversing the fields, recursively following each object as necessary and traversing its fields in turn, and so on; this is quickly going to run into problems, however, as shown here:
<person>
<firstName>Ron</firstName>
<lastName>Reynolds</lastName>
<age>31</age>
<spouse>
<person>
<firstName>Lisa</firstName>
<lastName>Reynolds</lastName>
<age>25</age>
<spouse>
<person>
<firstName>Ron</firstName>
<lastName>Reynolds</lastName>
<age>31</age>
<spouse>
<!-- Uh, oh . . . -->
As you can see, an infinite recursion develops here because the two objects are circularly referencing one another. We could fix this problem the same way that Java Object Serialization does (see Item 71), by keeping track of which items have been serialized and which haven't, but then we're into a bigger problem: Even if we keep track of identity within a given XML hierarchy, how do we do so across hierarchies? That is, if we serialize both the ron and lisa objects into two separate streams (perhaps as part of a JAX-RPC method call), how do we make the deserialization logic aware of the fact that the data referred to in the spouse field of ron is the same data referred to in the spouse field of lisa?
String param1 = ron.toXML(); // Serialize to XML
String param2 = lisa.toXML(); // Serialize to XML
sendXMLMessage("<parameters>" + param1 + param2 +
"</parameters>");
/* Produces:
param1 =
<person >
<firstName>Ron</firstName>
<lastName>Reynolds</lastName>
<age>31</age>
<spouse>
<person >
<firstName>Lisa</firstName>
<lastName>Reynolds</lastName>
<age>25</age>
<spouse><person href="id1" /></spouse>
</person>
</spouse>
</person>
param2 =
<person >
<firstName>Lisa</firstName>
<lastName>Reynolds</lastName>
<age>25</age>
<spouse>
<person >
<firstName>Ron</firstName>
<lastName>Reynolds</lastName>
<age>25</age>
<spouse><person href="id1" /></spouse>
</person>
</spouse>
</person>
*/
// . . . On recipient's side, how will we get
// the spouses correct again?
(By the way, this trick of using id and href to track object identity is not new. It's formally described in Section 5 of the SOAP 1.1 Specification, and as a result, it's commonly called SOAP Section 5 encoding or, more simply, SOAP encoding.) We're managing to keep the object references straight within each individual stream, but when we collapse the streams into a larger document, the two streams have no awareness of one another, and the whole object-identity scheme fails. So how do we fix this? The short but brutal answer is, we can't—not without relying on mechanisms outside of the XML Infoset Specification, which means that schema and DTD validation won't pick up any malformed data. In fact, the whole idea of object identity preserved by SOAP Section 5 encoding is entirely outside the Schema and/or DTD validator's capabilities and has been removed in the latest SOAP Specification (1.2). Cyclic references, which are actually much more common in object systems than you might think, will break a hierarchical data format every time. Some will point out that we can solve the problem by introducing a new construct into the stream that "captures" the two independent objects, as in the following code:
<marriage>
<person>
<!-- Ron goes here -->
</person>
<person>
<!-- Lisa goes here -->
</person>
</marriage>
But that's missing the point—in doing this, you've essentially introduced a new data element into the mix that doesn't appear anywhere in the object model it was produced from. An automatic object-to-XML serialization tool isn't going to be able to make this kind of decision, and certainly not without some kind of developer assistance. So what? It's not like we're relying on XML for data storage, for the most part—that's what we have the relational database for, and object-relational mapping layers will take care of all those details for us. Why bother going down this path of object-hierarchical mapping? If you're going to do Web Services, you're going to be doing object-hierarchical mapping: remember, SOAP Section 5 encoding was created to solve this problem because we want to silently and opaquely transform objects into XML and back again without any work on our part. And the sad truth is, just as object-relational layers will never be able to silently and completely take care of mapping objects to relations, object-hierarchical layers like JAXB or Exolab's Castor will never be able to completely take care of mapping objects to hierarchies. Don't think that the limitations all go just one way, either. Objects have just as hard a time with XML documents, even schema-valid ones, as XML has with object graphs. Consider the following schema:
<xsd:schema xmlns:xsd='http://www.w3.org/2001/XMLSchema'
xmlns:tns='http://example.org/product'
targetNamespace='http://example.org/product' >
<xsd:complexType name='Product' >
<xsd:sequence>
<xsd:choice>
<xsd:element name='produce'
type='xsd:string'/>
<xsd:element name='meat' type='xsd:string' />
</xsd:choice>
<xsd:sequence minOccurs='1'
maxOccurs='unbounded'>
<xsd:element name='state'
type='xsd:string' />
<xsd:element name='taxable'
type='xsd:boolean'/>
</xsd:sequence>
</xsd:sequence>
</xsd:complexType>
<xsd:element name='Product' type='tns:Product' />
</xsd:schema>
Here is the schema-valid corresponding document:
<groceryStore xmlns:p='http://example.org/product'>
<p:Product>
<produce>Lettuce</produce>
<state>CA</state>
<taxable>true</taxable>
<state>MA</state>
<taxable>true</taxable>
<state>CO</state>
<taxable>false</taxable>
</p:Product>
<p:Product>
<meat>Prime rib</meat>
<state>CA</state>
<taxable>false</taxable>
<state>MA</state>
<taxable>true</taxable>
<state>CO</state>
<taxable>false</taxable>
</p:Product>
</groceryStore>
Ask yourself this question: How on earth can Java (or, for that matter, any other traditional object-oriented language, like C++ or C#) represent this repeating sequence of element state/taxable pairs, or the discriminated union of two different element types, produce or meat? The closest approximation would be to create two subtypes, one each for the produce and meat element particles, then create another new type, this time for the state/taxable pairs, and store an array of those in the Product type itself. The schema defined just one type, and we have to define at least four in Java to compensate. Needless to say, working with this schema-turned-Java type system is going to be difficult at best. And things get even more interesting if we start talking about doing derivation by restriction, occurrence constraints (minOccurs and maxOccurs facets on schema compositors), and so on. JAXB and other Java-to-XML tools can take their best shot, but they're never going to match schema declarations one-for-one, just as schema and XML can't match objects one-for-one. In short, we have an impedance mismatch. Where does this leave us? For starters, recognize that XML models hierarchical data well but can't effectively handle arbitrary object graphs. In certain situations, where objects model into a neat hierarchy, the transition will be smooth and seamless, but it takes just one reference to something other than an immediate child object to seriously throw off object-to-XML serializers. Fortunately, strings, dates, and the wrapper classes are usually handled in a pretty transparent manner, despite their formal object status, so that's not an issue, but for anything else, be prepared for some weird and semi-obfuscated results from the schema-to-Java code generator. Second, take a more realistic view of what XML can do for you. Its ubiquity makes it a tempting format in which to store all your data, but the fact is that relational databases still rule the roost, and we're mostly going to use XML as an interoperability technology for the foreseeable future. Particularly with more and more RDBMS vendors coming to XML as a format with which to describe data, the chances of storing data as XML in an "XML database" are slight. Instead, see XML as a form of "data glue" between Java and other type systems, such as .NET and C++. A few basic principles come to mind, which I offer here with the huge caveat that some of these, like any good principles, may be sacrificed if the situation calls for it.
Most importantly, make sure that you understand the hierarchical data model and how it differs from relational and object models. Trying to use XML as an objects-first data repository is simply a recipe for disaster—don't go down that road. |
- Comment