Basics of Parsing Documents





Basics of Parsing Documents

This section describes how to parse well-formed and valid XML documents, and shows the differences between them.

1 Parsing Well-Formed Documents

In this section, we show how to read a simple XML document, called department.xml, using Xerces. This document represents a set of employee records in a department (see Listing 2.1). The meanings of the tags should be self-explanatory.

Listing 2.1 Simple XML document, employee records for a department, chap02/department.xml
<?xml version="1.0" encoding="utf-8"?>
<department>
   <employee id="J.D">
      <name>John Doe</name>
      <email>[email protected]</email>
   </employee>

   <employee id="B.S">
      <name>Bob Smith</name>
      <email>[email protected]</email>
   </employee>

   <employee id="A.M">
      <name>Alice Miller</name>
      <url href="http://www.foo.com/~amiller/"/>
   </employee>
</department>

This is a well-formed XML document, and it should be parsed by a non-validating XML processor. The first task of this book is to read and parse the document by using Xerces. We run the sample program, SimpleParse, located in samples\chap02 on the CD-ROM, using the following commands:

R:\samples>java chap02.SimpleParse chap02/department.xml
R:\samples>

This program, as in the previous section, produces no output. However, we know that Xerces did its job, because SimpleParse did the following:

  • Opened the XML document department.xml

  • Parsed it with an XML processor, which is described later

  • Created a corresponding data structure in memory, a structure that can later be referred to or manipulated by application programs such as a Java object

The fact that you see no output means that there were no violations of well-formedness (missing end tags, improper nesting, and so on). Listing 2.2 gives the source code of SimpleParse. Although a very short program, it shows the basics of how you can use Xerces.

Listing 2.2 Parsing an XML document (non-validating), chap02/SimpleParse.java
       package chap02;
       /**
        *       SimpleParse.java
        **/
[5]    import org.w3c.dom.Document;
[6]    import org.apache.xerces.parsers.DOMParser;
       import org.xml.sax.SAXException;
       import java.io.IOException;

       public class SimpleParse {
          public static void main(String[] argv) {
             if (argv.length != 1) {
                System.err.println(
                   "Usage: java chap02.SimpleParse <filename>");
                System.exit(1);
             }
             try {
[18]      // Creates a parser object
[19]      DOMParser parser = new DOMParser();
[20]      // Parses an XML Document
[21]      parser.parse(argv[0]);
[22]      // Gets a Document object
[23]      Document doc = parser.getDocument();
[24]      // Does something
[25]         } catch (SAXException se) {
[26]             System.out.println("Parser error found: "
[27]                                    +se.getMessage());
[28]             System.exit(1);
             } catch (IOException ioe) {
                 System.out.println("IO error found: "
                           + ioe.getMessage());
                 System.exit(1);
             }
          }
       }

Now we'll look at the program SimpleParse line by line, referring to the numbers in square brackets on the left side of the program listing. First, this class imports some classes to use with Xerces:

  • In line 5, the Document class, from the org.w3c.dom package, which is the interface that represents the whole XML document

  • In line 6, the DOMParser class, from org.apache.xerces.parsers, which is a DOM-based XML processor

Also, two exception classes (SAXException and IOException) are imported.

The heart of this program is in lines 19–22.

[19]   DOMParser parser = new DOMParser();

Line 19 creates a DOM-based processor to parse an XML document.

[21]   parser.parse(argv[0]);

Next, line 21 parses an XML document specified by a command-line argument (argv[0]).

In this case, the parse() method takes the filename of the XML document. The method has the following argument patterns (signatures), and you can choose the appropriate one.

  • Document parse (String uri)

  • Document parse (java.io.File f)

  • Document parse (org.xml.sax.InputSource is)

  • Document parse (java.io.InputStream is)

  • Document parse (java.io.InputStream is, String systemId)

The third one requires an object of the org.xml.sax.InputSource class, which is useful to wrap various input formats for an XML document to be parsed.

Though it is originally from the SAX 1.0 API, it is widely used for a DOM parser as well as a SAX parser.

The class has four constructors:

  • InputSource ()

  • InputSource (java.io.InputStream byteStream)

  • InputSource (java.io.Reader characterStream)

  • InputSource (java.lang.String systemId)

If you want to write a method (say, processWithParse()) that takes an input file name as an argument, processWithParse(InputSource is) is more reusable than processWithParse(File f) or processWithParse(String url).

[23]   Document doc = parser.getDocument();

Line 23 receives the Document instance. The org.w3c.dom.Document interface is specified by the DOM specification from W3C. The variable doc actually refers to an instance of an implementation class (org.apache.xerces.dom. Document/mpl) provided by Xerces. The instance represents the whole XML document and can contain (1) at most one DocumentType instance that represents a DTD, (2) one Element instance that represents a root element (which is called a document element), and (3) zero or more Comment and ProcessingInstruction instances. The interface provides methods to visit and modify child nodes of the root element. For example, an application can get the root (document) element of an XML document by using the getDocumentElement() method of the Document interface. This sample program is simple, but you can see many other programs in this book.

When something goes wrong, the program throws an exception. The program shown in Listing 2.2 catches the following two exceptions:

  • java.io.IOException— Occurs when the XML processor failed to load the XML document (because the file was not found, for example)

  • org.xml.sax.SAXException— Occurs when the input document violates the well-formedness constraints

You might think that this program has no practical value because it does not produce any output. However, it is useful as a syntax checker. It can tell you whether the input XML document is well-formed or not. To show you how this works, we give an XML document that is not well-formed, department2.xml, to SimpleParse in Listing 2.3.

Listing 2.3 Not well-formed XML document, chap02/department2.xml
<?xml version="1.0" encoding="utf-8"?>
<department>
   <employee id="J.D">
      <name>John Doe</name>
      <email>[email protected]</email1>
   </employee>

   <employee id="B.S">
      <name>Bob Smith</name>
      <email>[email protected]</email>
   </employee>

   <employee id="A.M">
      <name>Alice Miller</name>
      <url href="http://www.foo.com/~amiller/"/>
   </employee>
</department>

This document is not well-formed, because the end tag of the first email element is </email1>, not </email>. The result of parsing the document is as follows:

R:\samples>java chap02.SimpleParse chap02/department2.xml
Parser error found: The element type "email" must be terminated by
the matching end-tag "</email>".

The XML processor recognizes the mismatch of the start and end tags, and reports it to applications by an exception (SAXException). In Listing 2.2, the exception is caught in lines 25–28.

2 Parsing Valid Documents

In this section, we parse a valid XML document according to a DTD. An example called department-dtd.xml is shown in Listing 2.4. The DOCTYPE declaration (the second line) tells an XML processor the location of the DTD.

Listing 2.4 XML document with DTD, chap02/department-dtd.xml
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE department SYSTEM "department.dtd">
<department>
    <employee id="J.D">
       <name>John Doe</name>
       <email>[email protected]</email>
    </employee>

    <employee id="B.S">
       <name>Bob Smith</name>
       <email>[email protected]</email>
    </employee>

    <employee id="A.M">
       <name>Alice Miller</name>
       <url href="http://www.foo.com/~amiller/"/>
    </employee>
</department>

The DTD for the document is shown in Listing 2.5.

Listing 2.5 DTD for XML document, chap02/department.dtd
<!ELEMENT department (employee)*>
<!ELEMENT employee (name, (email | url))>
<!ATTLIST employee id CDATA #REQUIRED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT url EMPTY>
<!ATTLIST url href CDATA #REQUIRED>

As shown in Section 1.4.2, a DTD specifies the structure of an XML document. For example, the first element declaration in Listing 2.5 says a department element must have zero or more employee elements. The second declaration says an employee element must have a name element as the first child element and an email or url element as the second child element. The third one indicates an employee element must have an id attribute. The word #PCDATA means characters, and the url element cannot have any children (it is an empty element). Refer to the XML 1.0 specification for the details.

Xerces is a validating processor, but it does not validate by default. So we must tell Xerces to validate an input XML document against the DTD. Listing 2.6 shows a sample program for the validation.

Listing 2.6 Parsing an XML document (validating), chap02/SimpleParseWith Validation.java
       package chap02;
       /**
        *       SimpleParseWithValidation.java
        **/
       import org.w3c.dom.Document;
       import org.xml.sax.InputSource;
       import org.xml.sax.SAXException;
       import org.xml.sax.SAXParseException;
       import org.xml.sax.ErrorHandler;
       import org.apache.xerces.parsers.DOMParser;
       import share.util.MyErrorHandler;
       import java.io.IOException;

       public class SimpleParseWithValidation {

          public static void main(String[] argv) {
             if (argv.length != 1) {
                System.err.println("Usage: java "+
                   "chap02.SimpleParseWigthValidation <filename>");
                System.exit(1);
             }
             try {
                // Creates parser object
                DOMParser parser = new DOMParser();
[25]            // Sets an ErrorHandler
[26]            parser.setErrorHandler(new MyErrorHandler());
[27]            // Tells the parser to validate documents
[28]            parser.setFeature(
                   "http://xml.org/sax/features/validation",
                   true);
[31]            // Parses an XML Document
[32]            parser.parse(argv[0]);
[33]            // Gets a Document object
[34]            Document doc = parser.getDocument();
                // Does something
            } catch (Exception e) {
                e.printStackTrace();
            }
          }
       }

Again, let's look at the program in detail. First, a DOMParser object is created.

In SimpleParse, shown in Listing 2.2, we caught a SAXException exception when an input XML document was not well-formed. An XML processor provides an error handler to handle errors more flexibly. The error handler recognizes fatal errors that prevent it from continuing a parsing process, errors that are defined in the XML 1.0 Recommendation, and warnings for other problems.

Error handlers should implement the org.xml.sax.ErrorHandler interface. To create an error handler, there are two well-known methods.

  • An application itself implements org.xml.sax.ErrorHandler.

  • A separate class implements the interface, and an application call the class.

If you can work with a general error handler that can be shared with other applications, the latter approach is good in terms of software reuse. If you want to use an application-specific handler, or you don't want to create a new class for the handler for some reason, the former approach may be better.

This book employs the latter approach. MyErrorHandler, shown in Listing 2.7, is a typical implementation of an error handler.

Listing 2.7 Handling errors, share/util/MyErrorHandler.java
package share.util;

import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;

public class MyErrorHandler implements ErrorHandler {
   /** Constructor. */
   public MyErrorHandler(){
   }
   /** Warning. */
   public void warning(SAXParseException ex) {
      System.err.println("[Warning] "+
                         getLocationString(ex)+": "+
                         ex.getMessage());
   }
   /** Error. */
   public void error(SAXParseException ex) {
      System.err.println("[Error] "+
                         getLocationString(ex)+": "+
                         ex.getMessage());
   }
   /** Fatal error. */
   public void fatalError(SAXParseException ex) {
      System.err.println("[Fatal Error] "+
                         getLocationString(ex)+": "+
                         ex.getMessage());
   }
   /** Returns a string of the location. */
   private String getLocationString(SAXParseException ex) {
      StringBuffer str = new StringBuffer();

      String systemId = ex.getSystemId();
      if (systemId != null) {
         int index = systemId.lastIndexOf('/');
         if (index != -1)
            systemId = systemId.substring(index + 1);
         str.append(systemId);
      }
      str.append(':');
      str.append(ex.getLineNumber());
      str.append(':');
      str.append(ex.getColumnNumber());

      return str.toString();
   }

}

The org.xml.sax.ErrorHandler interface defines fatalError(), error(), and warning(). The MyErrorHandler class implements these methods to show a filename, line and column numbers, and the content of an error.

In SimpleParseWithValidation (see Listing 2.6), MyErrorHandler is created in line 26 and set to a parser object.

[25]   // Sets an ErrorHandler
[26]   parser.setErrorHandler(new MyErrorHandler());

Next, we tell the XML processor to turn on validation by using the setFeature() method. This is a method of the org.xml.sax.XMLReader interface that is implemented by the DOMParser classes. The method is used to set various features of an XML processor. In this book, we use some of the features (see Section 6.3.1 for more on these features). Refer to http://xml.apache.org/xerces-j/features.html for the complete list of features. Note that the default value of the validation feature ("http://xml.org/sax/features/validation") is false, so SimpleParse in the previous section did not check the validity of the XML document.

[27]   // Tells the parser to validate documents
[28]   parser.setFeature("http://xml.org/sax/features/validation", true);

Finally, we start parsing. This is the same process as in SimpleParse.

[31]   // Parses an XML Document
[32]   parser.parse(argv[0]);
[33]   // Gets a Document object
[34]   Document doc = parser.getDocument();

Now we run this program to parse a valid XML document, department-dtd.xml.

R:\samples>java chap02.SimpleParseWithValidation chap02/
department-dtd.xml
R:\samples>

Because department-dtd.xml shown in Listing 2.4 conforms to department.dtd (see Listing 2.5), it should be parsed without error. The next example is an invalid document, department-dtd2.xml, shown in Listing 2.8.

Listing 2.8 Invalid XML document, chap02/department-dtd2.xml
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE department SYSTEM "department.dtd">
<department>
   <employee>
      <name>John Doe</name>
      <email>[email protected]</email>
   </employee>

   <employee id="B.S">
      <name>Bob Smith</name>
      <email>[email protected]</email>
   </employee>

   <employee id="A.M">
      <name>Alice Miller</name>
      <url href="http://www.foo.com/~amiller/"/>
   </employee>
</department>

When we parse the document with SimpleParseWithValidation, we can see an error because the document does not conform to the DTD.

R:\samples>java chap02.SimpleParseWithValidation chap02/
department-dtd2.xml
[Error] 4:13 Attribute "id" is required and must be specified for
element type "employee".

As shown in the previous output, the fourth line of department-dtd2.xml has an error. The email element does not have an id attribute, although it is required. Errors and warnings with line numbers make it possible to recognize where and why they occurred.

NOTE

The difference between an error and a fatal error is defined in the XML 1.0 specification. An error is a violation of the rules of the specification. A conforming XML processor may detect and report an error and may recover from it. That means an application may get the internal structure of parsed XML documents. Violations of validity constraints are errors. On the other hand, the XML processor must detect and report fatal errors to the application. Once a fatal error is detected, the processor must not continue normal processing. Violations of well-formedness constraints are fatal errors.


3 Design Point: Well-Formed versus Valid

In the previous sections, you learned how to parse well-formed and valid documents. In this section, we discuss which types of documents should be used when you design and develop real Web applications. In other words, what are the pros and cons of validation? This section discusses the design point from several viewpoints.

  • If a document structure is strictly defined by a DTD, an application can skip a checking process for the structure. For example, suppose a DTD specifies a name element as a required element. Applications don't have to check the existence of the element, because if an XML document that conforms to the DTD is parsed by a validating XML processor, the element should appear in the document. If you use XML Schema (discussed in Chapter 9), data type checking is also done by a validating processor. This can prevent applications from stopping by receiving data that has an unexpected data type. That is, validation makes your applications simpler and safer.

  • Validation is an expensive task when very large documents are to be parsed. Even if the documents are not so large, it is time-consuming when many documents must be parsed at the same time in a high-volume Web application. In such cases, we should think carefully if we need validation.

  • When you consider a non-PC platform such as PDA devices, validation might be impossible in the limited-resource environment.

  • It is an important point for design if we really need validation. For example, suppose two companies want to exchange an XML document. If both companies know the structure of the document and the sending company sends a valid document, the receiving company can expect the document to always be valid and thus validation is not needed. As discussed before, parsing with validation makes applications safe. However, if the transaction between the companies is very large, parsing without validation may be a good choice.

  • If the structure of the document is not complex and applications do not require all the information in the document, validation might not be needed.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows