Choose SAX for Computer Efficiency





Item 32. Choose SAX for Computer Efficiency

For most applications, performing the XML processing with SAX will result in by far the fastest program. It will certainly allow the program to use much less memory than the DOM or JDOM equivalent. If maximum speed or minimum size is your primary concern, choose SAX.

What makes SAX so fast and so memory efficient is that it's effectively stateless. As the parser parses, it has an extremely local view of the XML document. It really doesn't need to concern itself with anything more than the current tag, text string, comment, or processing instruction. For purposes of well-formedness checking, a SAX parser has to hold on to a stack of ancestor elements so it can verify that each end-tag matches its corresponding start-tag. Beyond that, it really doesn't need any state at all. Since SAX stores such a small part of a typical XML document at any one time, it can easily process multigigabyte files. SAX happily parses even those supersized documents that make the most memory-efficient DOM parsers on the beefiest machines fall down in tears.

Note

There are a few internal limits on SAX. Although it has no fundamental limits on document size or the content of a text node, certain other constructs can be too large for it to handle. In particular, SAX represents the following XML constructs as single strings:

  • Element and attribute names

  • Attribute values

  • Processing instruction targets

  • Processing instruction data

  • Entity names

  • Comments

If any of these exceed the maximum length of a string (about 2.1GB in Java) or, more likely, the available memory on the local machine, the parser will likely crash or at best provide only part of the content. In practice, however, I've never encountered a document that caused this. I have seen documents where the text content of elements was too large for a single string, and wisely SAX does not require these to be reported with a single string. A SAX parser can split these into pieces and report one manageable piece at a time.


SAX parsers are fast because they do very little work. A SAX parser only has to do four things.

  1. Read the raw bytes of data.

  2. Convert the bytes to Unicode text.

  3. Verify its well-formedness.

  4. Pass the text to the client application.

A DOM parser, by contrast, has to do all of this plus one extra step: It has to build a potentially very complex and very large object representing the XML document. (See Item 33.)

Even more importantly for practical applications, SAX parsers and SAX applications are streaming. That is, a SAX parser informs the client application of some of the data in the document before the parser has finished reading the entire document. The client application can then act on that content before the rest of the data is available.

For example, consider a program that receives a list of stock quotes over the Internet formatted roughly as follows.

<Ticker>
  <Quote>
    <Symbol>AAPL</Symbol>
    <Price>14.35</Price>
  </Quote>
  <Quote>
    <Symbol>AMZN</Symbol>
    <Price>21.34</Price>
  </Quote>
  <!-- several thousand more ... -->
</Ticker>

The program can respond to each quote immediately, before the next quote is received and parsed. Provided the relevant data comes in nice localized chunks, this approach works very well. However, streaming is not suitable for all applications. In the stock quote example, the program might wish to make comparisons between the prices of the different stocks in the document before making its buy and sell decisions. In this case, it needs to wait until all the data is available.

Another potential downside to streaming processing is the need to roll back in the event of a well-formedness error. A program may correctly process 1000 quote elements, but if it then encounters a missing end-tag on the 1001st, you might want to discard all the processing that's been done on the document. In this case, you need to make sure that the preliminary processing does not do anything irreversible such as placing a buy or sell order. However, you might still be able to prepare the orders so they can be sent immediately as soon as the end of the document is detected and full well-formedness has been verified.

Then again, maybe it's OK to send the initial orders. A malformedness error in the 1001st quote element doesn't mean that the first 1000 elements were bad. It's much more likely that their data was correct but that a network glitch corrupted data in the 1001st element. XML does not allow you to process data that appears after the first well-formedness error in a document, but you are allowed to process data that appears before it.

There are times when this is clearly not the case, however. Consider XSLT for example. The template for any node in the document can reference any other node that comes either before or after it. The entire document needs to be parsed in memory before transformation can begin. Even in this case, however, it can still be faster to build the program's own data structures while parsing rather than waiting until the parsing is complete. Similarly, you save a lot of memory by not building two data structures to represent the same document, the one the parser produces and the one the XSLT processor requires. This does depend to some extent on the design of the XSLT processor. A processor that operated directly on a DOM Document object rather than using its own data structures and classes would not be noticeably slower or less efficient than one that built custom data structures directly from SAX events.

Sometimes, though, SAX makes processing faster even when you do want to read the entire document before acting on it. Consider the case where, rather than using an XML data structure such as DOM objects, you use classes and data structures more suited to the problem at hand. For example, the SAX parser reading the stock quote data might well be feeding it into objects that are instantiated from the following class.

public class Quote {
    private String symbol;
    private double price;

    // constructors, methods, and so forth...
}

A SAX program can build one Quote object for each Quote element in the input document as soon as it sees the element. When the document has been fully processed, the internal data structures are ready to go. However, using DOM, after the document has been parsed, the program still needs to walk the tree and construct the objects you actually want. If the local processing and construction of these objects is relatively fast compared to the time needed to receive the next Quote element from the network (and generally it is), the SAX program may be quite a bit faster than the DOM equivalent.

Given that SAX-based programs are almost always faster and more memory-efficient than DOM- or JDOM-based ones, the obvious question is why would anybody ever use a tree API in the first place. The answer is that sometimes it's simply easier to write the program using a tree-oriented API. While SAX maximizes computer efficiency, it's not always a maximum use of programmer efficiency. In my experience, most developers are much more comfortable with the pull style of DOM programs than the push style of SAX programs. Even switching to a pull-based, streaming API such as XMLPULL or StAX still requires you to respond to the data in the document where it appears rather than where you need it. The random access possible in a tree-based API can be very useful.

The second reason you might choose a tree-based API like DOM instead of SAX is that SAX is pretty much a read-only API. It can take the data in from an XML document, but it can't write the document back out again, nor does it give you any hooks for managing the document in memory. In practice, most programs that need to handle XML are primarily concerned with reading it. They convert the XML document into some custom data structure that's convenient for them and then manipulate that unique data structure. However, if instead you want to manipulate a data structure that's very close to the XML document and possibly export the modified XML document to text once again, APIs like DOM and JDOM provide a lot more support than SAX.

The key to most significant SAX programs is this data structure you build in memory as the document is read. It's rare that any one call to a SAX ContentHandler method like startElement() or text() actually gives you all the information you need when you need it. You almost always have to store this information somewhere until more data becomes available. It's only by combining the data passed to several methods that you get something you can work with. However, depending on the nature of the problem, the data structures can be more or less complex. Generally, the flatter the document structure and the more local the processing, the simpler the data structures are and the easier it is to make SAX work.

The stock quote document is a good example of the sort of problems for which SAX processing is straightforward. No elements are more than three deep from the root. No elements contain other elements of the same type. Internally, the data is not stored as XML. Instead it's stored as Quote elements. Each Quote element is pretty much complete unto itself without implicit or explicit linkages to other elements in the document. The overall structure is very flat, very predictable, and very simple. Finally, the program may well not export any XML.

By contrast, writing an XSLT processor is an example of a problem that is not well suited to SAX event-based processing. The data structures you need to store in memory are generally as complex as the document itself is. These data structures can be extremely nonflat and recursive. Processing the second element in the document may require access to the seventeenth, thirty-second, and/or last element in the document. The whole point of an XSL transformation is to create a new XML document that is normally then written onto a stream or into a file. This is about as extreme an example of a case where a tree-based API like DOM is necessary as I can imagine. At most, you would use SAX to build your own tree model on which the XSLT processor operated. You would not attempt to transform SAX events immediately as they're passed into the ContentHandler.

Summing up, you should choose to do your XML processing with SAX under the following conditions.

  • Speed matters.

  • Memory footprint matters.

  • You want to process small, contiguous chunks of the document at one time.

  • The program's internal data structures represent something other than the XML document itself.

  • You need to parse documents but not write them back out again.

The more of these characteristics a problem exhibits, the better SAX fits it.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows