Parsing XML Documents





Introduction

In this chapter, we discuss the parsing of XML documents using an XML processor. Parsing is the process of reading a document and dissecting it into its elements and attributes, which can then be analyzed. In XML, parsing is done by an XML processor, the most fundamental building block of a Web application. In this book, Apache Xerces is used as an implementation of the XML processor, and we show how to design and develop Web applications. As the first step of this chapter, we begin to set up your programming environment in Xerces and Java. Next, we discuss how to read and parse a simple XML document. We use various examples, including well-formed and valid documents with Document Type Definitions (DTDs) or XML Schema, a document that contains namespaces. We finish by explaining how to do basic programming using common APIs: DOM and SAX.

1 XML Processors

As explained in Chapter 1, an XML processor is a software module that reads XML documents and provides application programs with access to their content and structure. The XML 1.0 specification from W3C precisely defines the functions of an XML processor. The behavior of a conforming XML processor is highly predictable, so using other conforming XML processors should not be difficult.

Figure shows the role of an XML processor. An XML processor is a bridge between an XML document and an application. The XML processor parses and generates an XML document. The application uses an API to access objects that represent part of the XML document. DOM and SAX are well-known APIs for accessing the structure of an XML document. Throughout this book, you will learn the details of these APIs.

1. Using an XML processor

graphics/02fig01.gif

XML processors are categorized as validating and non-validating processors (see Section 1.4.2 for an explanation of validity and well-formedness). When reading an XML document, a non-validating processor checks the well-formedness constraints as defined in the XML 1.0 specification and reports any violations. A validating processor must check the validity constraints and the well-formedness constraints.

In this book, we use the Java version of Apache Xerces, a validating (and non-validating) XML processor. Xerces was developed by the Apache Xerces team (one of the authors is a main member of the development team) and is one of the most robust and faithful implementations of an XML processor. In the first edition of this book, the XML for Java Parser (aka XML4J), developed by another one of the authors, was used. XML for Java was donated to Apache, an open source community in 1999, and now it is called Xerces. If you want to use Xerces commercially, please read the license document on the Apache Xerces Web site (http://xml.apache.org).

The complete current release of Xerces is included on the accompanying CD-ROM. You can also download the latest version of Xerces from the Apache Xerces Web site.

2 Working with Xerces

Before installing Xerces, you need to set up your Java programming environment. All the programs used in this book have been tested against the Java 2 SDK (versions 1.2 and 1.3). The setup steps are as follows:

  1. Install the Java 2 SDK (version 1.2 or 1.3).

  2. Install Xerces version 1.4.3.

  3. Add Xerces's jar files to the CLASSPATH environment variable.

Xerces is written in Java, so you first need to have Java 2 installed on your system. If needed, you can download the latest release from the Sun Microsystems Web site at http://java.sun.com. In this book, we assume you have installed the Java 2 SDK in C:\jdk.

The second step in setting up your programming environment is to install Xerces. In developing our sample programs, we used Xerces version 1.4.3. The CD-ROM that accompanies this book contains that version. To install Xerces:

  1. Install Xerces on your system.

  2. On the CD-ROM, move to the directory containing Xerces.

  3. Unzip Xerces-J-bin.1.4.3.zip.

We assume you have installed Xerces in C:\xerces-1_4_3.

Note that because Xerces is written in Java, theoretically it can run on any operating system platform on any hardware that supports Java. However, platforms might differ, for example, in how to set the environment variable. We use Windows (95/98/Me/NT/2000) in our command-line input/output examples in this book. If your platform is other than these, you should replace the command prompts and certain shell commands with those appropriate for your platform.

The third step in setting up your programming environment is to set the CLASSPATH environment variable to tell the Java interpreter where to find the Java libraries. To execute the sample programs in this book, you must have in your CLASSPATH the jar files c:\xerces-1_4_3\xerces.jar and c:\xerces-1_4_3\xercesSamples.jar. You might also want to include the current directory (.) and the sample directory of the CD-ROM (R:\samples) in your CLASSPATH. You can set both of these in Windows 95/98/Me by using the following command:

c:\xerces-1_4_3>set CLASSPATH=".;c:\xerces-1_4_3\xerces.jar;c:\
xerces-1_4_3\xercesSamples.jar"

You might also want to add this command line to your profile to avoid having to type it every time you bring up a new command prompt. In Windows 95/98/Me, you add it to the autoexec.bat file. In Windows NT, you add it by right-clicking My Computer and then left-clicking System Properties and the Environment tab; then add the new variable CLASSPATH (similar operations are needed in Windows 2000).

NOTE

When you are working with Xerces, you might want to know what the version is. The easiest way to find out is to type the following commands:

R:\samples>java org.apache.xerces.framework.Version Xerces 1.4.3

If you are using the Java 2 SDK provided by IBM, you should be careful which version of Xerces you add in your CLASSPATH. Because in IBM's Java 2 SDK 1.3, Xerces is located in the directory jdk\jre\lib\ext, all the jar files in this directory are recognized by the Java interpreter before reading CLASSPATH. If the version of Xerces is old, you would face some errors. To avoid this, you can simply delete xerces.jar or replace it with the latest version. Another way to use an appropriate version of Xerces is to specify -Djava.ext.dirs=nulldir when you execute the Java command. This option tells the interpreter not to load the jar files in the ext directory.

To see whether the installation was successful, move to the installation directory (c:\xerces-1_4_3) and enter the following commands:

c:\xerces-1_X_X>java sax.Counter data/personal.xml
data/personal.xml: 2.160 ms (37 elems, 18 attrs, 140 spaces, 128 chars)

This program parses an XML document and reports the number of elements, attributes, and so on.

An alternative way to tell the Java interpreter where to find the jar files is to enter the following command:

c:\xerces-1_4_3>java -classpath "c:\xerces-1_4_3\xerces.jar;c:\
xerces-1_4_3\xercesSamples.jar" sax.SAXCount data/personal.xml data/
personal.xml: 260 ms (37 elems, 18 attrs, 140 spaces, 128 chars)

Now you are ready to try the sample programs on the CD-ROM. Go to the samples directory, which contains all the samples in this chapter. Note that in our samples we use "R" for the CD-ROM drive; you should substitute the correct letter for your own CD-ROM drive.

The samples directory contains sample programs for each chapter, and package names are assigned to the classes. For example, the SimpleParse class used in this chapter has the package name chap02.

Enter the following command to launch the program SimpleParse to read the document department.xml:

R:\samples>java chap02.SimpleParse chap02/department.xml

You will see nothing. However, this is expected because this sample program produces no output if successful.

All the sample programs in this book are included on the CD-ROM. Installation instructions for the tools used in the chapters are described in the readme.html file stored in directories for each chapter. Take a few moments to explore the CD-ROM before moving on to the next section.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows