Item 43: Recognize the object-hierarchical impedance mismatch





Item 43: Recognize the object-hierarchical impedance mismatch

XML is everywhere, including in your persistence plans.

Once we'd finally gotten around to realizing that XML was all about data and not a language for doing markup itself as HTML was, industry pundits and writers started talking about XML as the logical way to represent objects in data form. Shortly thereafter, the thought of using XML to marshal data across the network was introduced, and SOAP and its accompanying follow-up Web Service specifications were born.

The problem is that XML is intrinsically a hierarchical way to represent data—look at the XML Infoset Specification, which requires that data be well formed, meaning the elements in an XML document must form a nice tree of elements (each element can have child elements nested within it, each element has a single parent in which it's nested, with the sole exception of the single "root" node that brackets the entire document, and so on). This means that XML is great for representing hierarchical data (hence the title of this item), and assuming your objects form a neat hierarchy, XML is a natural way to represent that data (hence the natural assumption that XML and objects go hand in hand).

But what happens when objects don't form nice, natural trees?

Hierarchical data models are not new; in fact, they're quite old. The relational data model was an attempt to find something easier to work with than the database systems of the day, which were similar in concept, if not form, to the hierarchical model we see in XML today. The problem with the hierarchical model at the time was that attempting to find data within it was difficult. Users had to navigate the elements of the tree manually, leaving users to figure out "how" instead of focusing on "what"—that is, how to get to the data, rather than what data they were interested in.

With the emergence of XML (and the growing interest in "XML databases," despite the inherent ambiguity in that term), it would seem that hierarchical data models are becoming popular once again. While a full discussion of the implications of a hierarchical data model are beyond the scope of this book, it's important to discuss two things here: when we're likely to use a hierarchical data model in J2EE, and what implications that will have for Java programmers.

While the industry currently doesn't recognize it, mapping objects to XML (the most common hierarchical storage model today) is not a simple thing, leading us to wonder whether an object-hierarchical impedance mismatch—in other words, a mismatch between the free-form object model we're all used to and the strictly hierarchical model the XML Infoset imposes—is just around the corner.[3] In fact, given that we now have vendors offering libraries to map objects to XML for us, as well as the more recent Java API for XML Binding (JAXB) standard to help unify the various implementations that do so, it may be fair to infer that mapping objects to XML and back again isn't as simple as it seems—granted, simple object models map to XML pretty easily, but then again, simple object models map pretty easily to relational tables, too, and we all know how "easy" it is to do object-relational mapping.

[3] Let's not even consider the implications of objects stored in relational databases being transformed into XML: the idea of an object-relational-hierarchical impedance mismatch is enough to move the strongest programmer to tears.

Much of the problem with mapping objects to a hierarchical model is the same problem that occurs when mapping objects to a relational model: preserving object identity. To understand what I mean, let's go back for a moment to the same Person object we've used in previous items:






public class Person

{

  // Fields public just for simplicity

  //

  public String firstName;

  public String lastName;

  public int age;



  public Person(String fn, String ln, int a)

  { firstName = fn; lastName = ln; age = a; }

}


Again, simple and straightforward, and it's not overly difficult to imagine what an XML representation of this object would look like:






<person>

  <firstName>Ron</firstName>

  <lastName>Reynolds</lastName>

  <age>30</age>

</person>


So far, so good. But now, let's add something that's completely reasonable to expect within an object-oriented model but completely shatters a hierarchical one—cyclic references:






public class Person

{

  public String firstName;

  public String lastName;

  public int age;

  public Person spouse;



  public Person(String fn, String ln, int a)

  { firstName = fn; lastName = ln; age = a; }

}


How do you represent the following set of objects?






Person ron = new Person("Ron", "Reynolds", 31);

Person lisa = new Person("Lisa", "Reynolds", 25);

ron.spouse = lisa;

lisa.spouse = ron;


A not-unreasonable approach to serializing ron out to XML could be done by simply traversing the fields, recursively following each object as necessary and traversing its fields in turn, and so on; this is quickly going to run into problems, however, as shown here:






<person>

  <firstName>Ron</firstName>

  <lastName>Reynolds</lastName>

  <age>31</age>

  <spouse>

    <person>

      <firstName>Lisa</firstName>

      <lastName>Reynolds</lastName>

      <age>25</age>

      <spouse>

        <person>

          <firstName>Ron</firstName>

          <lastName>Reynolds</lastName>

          <age>31</age>

          <spouse>

            <!-- Uh, oh . . . -->


As you can see, an infinite recursion develops here because the two objects are circularly referencing one another. We could fix this problem the same way that Java Object Serialization does (see Item 71), by keeping track of which items have been serialized and which haven't, but then we're into a bigger problem: Even if we keep track of identity within a given XML hierarchy, how do we do so across hierarchies? That is, if we serialize both the ron and lisa objects into two separate streams (perhaps as part of a JAX-RPC method call), how do we make the deserialization logic aware of the fact that the data referred to in the spouse field of ron is the same data referred to in the spouse field of lisa?






String param1 = ron.toXML(); // Serialize to XML

String param2 = lisa.toXML(); // Serialize to XML

sendXMLMessage("<parameters>" + param1 + param2 +

               "</parameters>");



/* Produces:

param1 =

<person >

  <firstName>Ron</firstName>

  <lastName>Reynolds</lastName>

  <age>31</age>

  <spouse>

    <person >

      <firstName>Lisa</firstName>

      <lastName>Reynolds</lastName>

      <age>25</age>

      <spouse><person href="id1" /></spouse>

    </person>

  </spouse>

</person>

param2 =

<person >

  <firstName>Lisa</firstName>

  <lastName>Reynolds</lastName>

  <age>25</age>

  <spouse>

    <person >

      <firstName>Ron</firstName>

      <lastName>Reynolds</lastName>

      <age>25</age>

      <spouse><person href="id1" /></spouse>

    </person>

  </spouse>

</person>

 */



// . . . On recipient's side, how will we get

// the spouses correct again?


(By the way, this trick of using id and href to track object identity is not new. It's formally described in Section 5 of the SOAP 1.1 Specification, and as a result, it's commonly called SOAP Section 5 encoding or, more simply, SOAP encoding.) We're managing to keep the object references straight within each individual stream, but when we collapse the streams into a larger document, the two streams have no awareness of one another, and the whole object-identity scheme fails. So how do we fix this?

The short but brutal answer is, we can't—not without relying on mechanisms outside of the XML Infoset Specification, which means that schema and DTD validation won't pick up any malformed data. In fact, the whole idea of object identity preserved by SOAP Section 5 encoding is entirely outside the Schema and/or DTD validator's capabilities and has been removed in the latest SOAP Specification (1.2). Cyclic references, which are actually much more common in object systems than you might think, will break a hierarchical data format every time.

Some will point out that we can solve the problem by introducing a new construct into the stream that "captures" the two independent objects, as in the following code:






<marriage>

  <person>

    <!-- Ron goes here -->

  </person>

  <person>

    <!-- Lisa goes here -->

  </person>

</marriage>


But that's missing the point—in doing this, you've essentially introduced a new data element into the mix that doesn't appear anywhere in the object model it was produced from. An automatic object-to-XML serialization tool isn't going to be able to make this kind of decision, and certainly not without some kind of developer assistance.

So what? It's not like we're relying on XML for data storage, for the most part—that's what we have the relational database for, and object-relational mapping layers will take care of all those details for us. Why bother going down this path of object-hierarchical mapping?

If you're going to do Web Services, you're going to be doing object-hierarchical mapping: remember, SOAP Section 5 encoding was created to solve this problem because we want to silently and opaquely transform objects into XML and back again without any work on our part. And the sad truth is, just as object-relational layers will never be able to silently and completely take care of mapping objects to relations, object-hierarchical layers like JAXB or Exolab's Castor will never be able to completely take care of mapping objects to hierarchies.

Don't think that the limitations all go just one way, either. Objects have just as hard a time with XML documents, even schema-valid ones, as XML has with object graphs. Consider the following schema:






<xsd:schema xmlns:xsd='http://www.w3.org/2001/XMLSchema'

    xmlns:tns='http://example.org/product'

    targetNamespace='http://example.org/product' >

  <xsd:complexType name='Product' >

    <xsd:sequence>

      <xsd:choice>

        <xsd:element name='produce'

                     type='xsd:string'/>

        <xsd:element name='meat' type='xsd:string' />

      </xsd:choice>

      <xsd:sequence minOccurs='1'

                    maxOccurs='unbounded'>

        <xsd:element name='state'

                     type='xsd:string' />

        <xsd:element name='taxable'

                     type='xsd:boolean'/>

      </xsd:sequence>

    </xsd:sequence>

  </xsd:complexType>

  <xsd:element name='Product' type='tns:Product' />

</xsd:schema>


Here is the schema-valid corresponding document:






<groceryStore xmlns:p='http://example.org/product'>

  <p:Product>

    <produce>Lettuce</produce>

    <state>CA</state>

    <taxable>true</taxable>

    <state>MA</state>

    <taxable>true</taxable>

    <state>CO</state>

    <taxable>false</taxable>

  </p:Product>

  <p:Product>

    <meat>Prime rib</meat>

    <state>CA</state>

    <taxable>false</taxable>

    <state>MA</state>

    <taxable>true</taxable>

    <state>CO</state>

    <taxable>false</taxable>

  </p:Product>

</groceryStore>


Ask yourself this question: How on earth can Java (or, for that matter, any other traditional object-oriented language, like C++ or C#) represent this repeating sequence of element state/taxable pairs, or the discriminated union of two different element types, produce or meat? The closest approximation would be to create two subtypes, one each for the produce and meat element particles, then create another new type, this time for the state/taxable pairs, and store an array of those in the Product type itself. The schema defined just one type, and we have to define at least four in Java to compensate.

Needless to say, working with this schema-turned-Java type system is going to be difficult at best. And things get even more interesting if we start talking about doing derivation by restriction, occurrence constraints (minOccurs and maxOccurs facets on schema compositors), and so on. JAXB and other Java-to-XML tools can take their best shot, but they're never going to match schema declarations one-for-one, just as schema and XML can't match objects one-for-one. In short, we have an impedance mismatch.

Where does this leave us?

For starters, recognize that XML models hierarchical data well but can't effectively handle arbitrary object graphs. In certain situations, where objects model into a neat hierarchy, the transition will be smooth and seamless, but it takes just one reference to something other than an immediate child object to seriously throw off object-to-XML serializers. Fortunately, strings, dates, and the wrapper classes are usually handled in a pretty transparent manner, despite their formal object status, so that's not an issue, but for anything else, be prepared for some weird and semi-obfuscated results from the schema-to-Java code generator.

Second, take a more realistic view of what XML can do for you. Its ubiquity makes it a tempting format in which to store all your data, but the fact is that relational databases still rule the roost, and we're mostly going to use XML as an interoperability technology for the foreseeable future. Particularly with more and more RDBMS vendors coming to XML as a format with which to describe data, the chances of storing data as XML in an "XML database" are slight. Instead, see XML as a form of "data glue" between Java and other type systems, such as .NET and C++.

A few basic principles come to mind, which I offer here with the huge caveat that some of these, like any good principles, may be sacrificed if the situation calls for it.

  • Use XML Schema to define your data types. Just as you wouldn't realistically consider doing an enterprise project storing data in a relational database without defining relational constraints on your data model, don't realistically consider doing enterprise projects storing data in XML without defining XML data types. Having said that, however, at times, a more flexible model of XML will be useful, such as when allowing for user extensions to the XML data instances. Be prepared for this by writing your types to have extension points within the type declarations, via judicious use of any-type elements. And, although this should be rare within the enterprise space, in some cases some XML data needs to be entirely free-form in order to be useful, such as Ant scripts. In those situations, be strong enough to recognize that you won't be able to (or want to) have schema types defined and that they will require hand-verification and/or parsing.

  • Prefer a subset of the schema simple types. XML Schema provides a rich set of simple types (those that most closely model primitive types in Java), such as the yearMonth and MonthDay types for dates, but Java has no corresponding equivalent—a schema-to-Java toll will most likely model both of those, as well as many others, as a simple integer field. Unfortunately, that means you can store anything you want into that field, thus losing the type definition intended by the schema document in the first place. To avoid this situation, prefer to stick to the types in XSD schema that most closely model what Java (and .NET and any other language you may end up talking to) can handle easily.

  • Use XML Schema validating parsers to verify instances of your schema types when parsing those instances. The parser will flag schema-invalid objects, essentially acting as a kind of data-scrubbing and input-validating layer for you, without any work on your part. This will in turn help enforce that you're using document-oriented approaches to your XML types, since straying from that model will flag the validator. Be aware, though, that schema-validating parsers are woefully slow compared to their non-schema-validating counterparts. More importantly, schema-validating parsers will only be able to flag schema-invalid objects, and if you're taking an object-based approach to your XML types that uses out-of-band techniques (like SOAP encoding does), schema validators won't pick it up, so you'll know there's a problem only when the parser buys off on the object but your code doesn't. That's a hard scenario to test for.

  • Understand that type definitions are relative. Your notion of what makes a Person is different from my notion of what makes a Person, and our corresponding definitions of Person (whether in object type definitions, XML Schema, or relational schema) differ accordingly. While some may try to search for the Universal Truth regarding the definition of Person, the fact is that there really isn't one—what's important to your organization about Person is different from what's important to my organization, and no amount of persuasion on your part is going to change that for me, and vice versa. Instead of going down this dead-end road, simply agree to disagree on this, and model schema accordingly if the documents described by this schema are to be used by both of us. In other words, use schema to verify data being sent from one system to another, rather than trying to use it to define types that everybody agrees on.

  • Avoid HOLDS-A relationships. As with the Person example, instances that hold "references" to other instances create problems. Avoid them when you can. Unfortunately, that's a lot easier said than done. There is no real way to model Person in a document-oriented fashion if Person needs to refer to the spouse and still preserve object identity. Instead, you're going to have to recognize that the Person's spouse is an "owned" data item—so instead of trying to model it as a standalone Person, just capture enough data to uniquely identify that Person from some other document (much as a relational table uses a foreign key to refer to a primary key in another table). Unfortunately, again, XML Schema can't reflect this[4] and it will have to be captured in some kind of out-of-band mechanism; no schema constraints can cross documents, at least not as of this writing.

    [4] If this XML document is a collection of Person instances, and the spouse is guaranteed to be within this collection someplace, that's a different story, but that also changes what we're modeling here and becomes a different problem entirely.

Most importantly, make sure that you understand the hierarchical data model and how it differs from relational and object models. Trying to use XML as an objects-first data repository is simply a recipe for disaster—don't go down that road.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows