Choose DOM for Standards Support





Item 33. Choose DOM for Standards Support

While SAX programs are almost always faster and more memory efficient than the DOM equivalents, performance is not a god to be worshiped above all others. There are many times when using DOM makes sense. In particular, for many classes of applications, programmers will find DOM much easier to work with. If shaving off 10% of execution time or 90% of space matters less to you than saving 10% of development time, you need to consider whether DOM might better fit the problem at hand than SAX.

In particular, the following characteristics indicate that a problem might be profitably addressed by DOM.

  • The documents are small relative to the available memory; roughly 10% of the heap size or less.

  • Processing requires random access to widely separated parts of the tree. For instance, you need access to the last elements in the document before you can figure out what to do with the first elements.

  • The program needs to navigate backward and up in the tree, as well as forward and down.

  • The program needs to access all the data in the document, not just a subset of it. The program's own data structures essentially reproduce the complexity of the XML document.

  • The developers find working with a tree structure to be easier and more natural than working with an event sequence.

  • The program has to be portable across languages. (I can't honestly say I've ever encountered this requirement in practice, but this need explains a lot of the weirdness in DOM.)

All of these requirements are fuzzy. If speed matters more to you than product development time or memory usage, you may choose to use SAX even for a system that uses data structures as complex as the XML document itself and requires random access to the tree. The only criterion that's really carved in stone is memory. If the program needs to process documents that are large compared to the available memory, you really have to use a streaming API such as SAX. Otherwise, a lot depends on your comfort level and the need for each characteristic.

If my recommendation for DOM sounds a lot more reticent than that for SAX, there's a good reason. DOM can be just plain weird. It is very much like the proverbial horse designed by committee, and, to be perfectly honest, camels don't smell as bad as DOM. DOM is packed with gotchas. Here's a representative sampling of just a few.

  • DOM requires code to specify namespace URIs when creating new Element and Attr objects and to add xmlns and xmlns:prefix attributes to elements where necessary. This is not an either/or. You must do them both, and DOM makes no efforts to make sure they don't accidentally conflict with each other.

  • Nodes cannot be moved from one document to another.

  • There are separate methods to add an Attr object to an Element object depending on whether or not the attribute is in a namespace.

  • Every node class has a getValue() method, but more often than not this just returns null.

  • Every node has namespaceURI and localName properties, even nodes like comments and text nodes that have neither names nor namespace URIs.

  • Types are represented by integer constants rather than by classes.

  • The document type declaration can be set only when the document is created; from thereon it is read-only.

I could go on—I haven't even begun to consider issues like naming conventions and the use of short constants that may make sense to programmers in some languages but not others—but I'll restrain myself. DOM is such an incredibly baroque API that most experienced XML developers turn to it only as a last resort.

Most of the reasons to use DOM are really reasons to use a tree-based API that holds the document in memory. There's no particular reason this has to be DOM instead of JDOM, XOM, dom4j, or any of the numerous other tree-based APIs. Microsoft implements DOM in MSXML but has added so many additional nonstandard methods that the resulting API really isn't DOM at all. (See Item 31.) Indeed the proliferation of alternate tree-based object models for XML is a symptom of the widespread dissatisfaction with DOM in the developer community. By way of contrast, the much cleaner SAX API has the field of push parsing XML almost completely to itself. There are a few rough spots in SAX, but none of them have itched developers enough to make them replace it. By contrast, DOM makes developers itch worse than the fleas of a thousand camels.

Considered relative to other tree-based APIs, where does DOM stand out? I've talked about its unique weaknesses. What, if any, are its unique strengths? Believe it or not, there are a few that occasionally suggest or even mandate its use.

  • You want the option to change implementations to find the one that performs best on your workload. DOM is the only tree-based API that is independently implemented by many vendors.

  • You need to interface with APIs and libraries that expect to receive DOM objects. DOM predates all the efforts to replace it, so it is much more broadly supported than alternative APIs.

  • Company rules, government regulations, or contractual obligations require you to use standard APIs where available. DOM is a W3C recommendation. None of the competing APIs are any form of ISO, OASIS, IETF, IEEE, or W3C standard.

In brief, DOM is more standard and more broadly supported than other APIs, and this may be important in situations where you need to exchange code with diverse programmers. However, it is not the cleanest, most efficient, fastest, nor most productive API you can use to process XML.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows