June 1, 2011, 11:55 a.m.
posted by factorial
Setting Up SAX
I'm increasingly of the "learning is best done by doing" philosophy, so I'm not going to hit you with a bunch of concept and theory before getting to code. SAX is a simple API, so you only need to understand its basic model, and how to get the API on your machine; beyond that, code will be your best teacher.
Callbacks and Event-Based Programming
SAX uses a callback model for interacting with your code; you may also have heard this model called event-based programming. Whatever you call it, it's a bit of a departure for object-oriented developers, so give it some time if you're new to this type of programming.
In short, the parsing process is going to hum along, tearing through an XML document. Every time it encounters a tag, or comment, or text, or any other piece of XML, it calls back into your code, signaling that an event has occurred. Your code then has an opportunity to act, based on the details of that event.
For example, if SAX encounters the opening tag of an element, it fires off a startElement event. It provides information about that event, such as the name of the element, its attributes, and so on, and then your code gets to respond. You, as a programmer, have to write code for each event that is important to youfrom the start of a document to a comment to the end of an element. This process is summed up in Figure.
The parsing process is controlled by the parser and your code listens for events, responding as they occur
What's different about this model is that your code is not active, in the sense that it doesn't ever instruct the parser, "Hey, go and parse the next element." It's passive, in that it waits to be called, and then leaps into action. This takes a little getting used to, but you'll be an old hand by the end of the chapter.
The SAX API
Unsurprisingly, the SAX API is made up largely of interfaces that define these various callback methods. You would implement the ContentHandler interface, and provide an implementation for the characters( ) method (for example) to handle events triggered by character processing. Figure provides a visual overview of the API; you'll see that it's remarkably simple.
SAX is a powerful API, even though it's largely interfaces and a few helper classes
Keep in mind that a SAX-compliant parser will not implement many of these interfaces (EntityResolver, ContentHandler, ErrorHandler, etc.); that's the job of you, the programmer. The parser from your vendor will implement the XMLReader interfaceand a few helper interfaces like Attributesand provide parsing behavior; everything else is left up to you.
SAX Parsing Setup
Like most APIs, getting setup to work with SAX just involves a download or two. You'll need the SAX classes and interfaces, obviously, as well as a concrete implementation of those interfaces. This is all usually bundled into one download; for example, the Apache Xerces project allows you to download one large file that contains several JAR files, containing everything from the SAX API to several parser implementations to examples and help files.
Choose the parser you want to use (or, if you're at a big company, ask your boss or co-workers what parser they're using), and download that parser's implementation. For Xerces, visit http://xml.apache.org/xerces2-j, and click the Download link on the left. Navigate to the correct download (the binary release is what most users want), and grab the file from Apache's server, or a mirror (see Figure).
Windows users, download the ZIP file; Unix and Mac OS X geeks, try the GZIPped TAR file
You'll need to consult your parser documentation as to what your classpath should look like. For Xerces, you'll want to include the xml-apis.jar and xercesImpl.jar files on your classpath; both are in the Xerces distribution's bin/ directory. For example, here's a fragment of my .profile on Mac OS X:
export JAVA_BASE=/usr/local/java export XERCES_HOME=$jAVA_BASE/xerces-2_6_2 export XALAN_HOME=$JAVA_BASE/xalan-j_2_6_0 export CVS_RSH=ssh export PS1="[\Qwhoami\Q:\w] " export CLASSPATH=$XERCES_HOME/xml-apis.jar:$XERCES_HOME/xercesImpl.jar
In the Xerces case, xml-apis.jar contains XML standard APIs like SAX and DOM, and (you never would have guessed this) xercesImpl.jar is the Xerces implementation of these APIs.