The ProcessingInstruction Interface





The ProcessingInstruction Interface

The ProcessingInstruction interface represents a processing instruction such as <?xml-stylesheet type="text/css" href="order.css"?> or <?php echo "Hello World";?>.

Figure summarizes the ProcessingInstruction interface. This interface adds methods to get the target and the data of the processing instruction as strings. Even if the data has a pseudo-attribute format, as in <?xml-stylesheet type="text/css" href="order.css"?>, DOM doesn't recognize that. For this processing instruction, the target is xml-stylesheet and the data is type="text/css" href="order.css."

17 The ProcessingInstruction Interface
package org.w3c.dom;

public interface ProcessingInstruction extends Node {

  public String getTarget();
  public String getData();
  public void   setData(String data) throws DOMException;

}

As usual, ProcessingInstruction objects also have all the methods of the Node superinterface, such as getNodeName() and getNodeValue(). The value of a processing instruction is its data. Processing instructions do not have children, however, so Node methods like getFirstChild() return null, and methods such as appendChild() throw a DOMException with the code HIERARCHY_REQUEST_ERR.

As an example, let's extend the earlier XLinkSpider program in Figure so that it respects robots processing instructions. Such an instruction looks like this, and appears in the prolog of an XML document:

<?robots index="yes" follow="no"?> 

The semantics of this instruction is deliberately similar to the robots META tag in HTML. That is, follow="yes" means robots should follow links they find in this page; follow="no" means they shouldn't. Similarly, index="yes" means search engines should include this page; index="no" means they shouldn't.

Like many processing instructions, the syntax is based on pseudo-attributes. DOM doesn't provide any means to parse these, even though it's a very common format for processing instructions. However, you can fake DOM out. I'm going to extract the target and data of the processing instruction and use them to form a string that has this format:

<target data/> 

In other words, a processing instruction such as <?robots index="yes" follow="no"?> is going to turn into a String like <robots index="yes" follow="no" />. This string is in turn a well-formed XML document that can be parsed and its attributes extracted. Admittedly, this approach is very circuitous and probably not optimally efficient. On the other hand, it's a lot easier to code and explain than writing your own mini-parser just to handle pseudo-attributes. Figure is a simple utility class that implements this hack. The parsing is completely hidden inside the constructor, so if this is too offensive to your sensibilities, you can replace it with more appropriate code without changing the public interface. Because this class is quite useful in practice, not merely an example for this book, I've placed it in the com.macfaq.xml package. Don't forget to configure your class and source paths appropriately when compiling it.

18 Reading PseudoAttributes from a ProcessingInstruction
package com.macfaq.xml;

import org.w3c.dom.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import java.io.*;

public class PseudoAttributes {

  private NamedNodeMap pseudo;

  public PseudoAttributes(ProcessingInstruction pi)
   throws SAXException {

    StringBuffer sb = new StringBuffer("<");
    sb.append(pi.getTarget());
    sb.append(" ");
    sb.append(pi.getData());
    sb.append("/>");
    StringReader reader = new StringReader(sb.toString());
    InputSource source = new InputSource(reader);
    try {
      DocumentBuilderFactory factory
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder parser = factory.newDocumentBuilder();

      // This line will throw a SAXException if the processing
      // instruction does not use pseudo-attributes.
      Document doc = parser.parse(source);
      Element root = doc.getDocumentElement();
      pseudo = root.getAttributes();

    }
    catch (FactoryConfigurationError e) {
      // I don't absolutely need to catch this, but I hate to
      // throw an Error for no good reason.
      throw new SAXException(e.getMessage());
    }
    catch (SAXException e) {
      throw e;
    }
    catch (Exception e) {
      throw new SAXException(e);
    }

  }

  // delegator methods
  public Attr item(int index) {
    return (Attr) pseudo.item(index);
  }

  public int getLength() {
    return pseudo.getLength();
  }

  public String getValue(String name) {
    Attr att = (Attr) pseudo.getNamedItem(name);
    if (att == null) return "";
    return att.getValue();
  }

}

This class makes it easy for the earlier DOMSpider program in Figure to recognize the robots processing instruction. I won't repeat the entire program, most of which hasn't changed. The relevant change is in the spider() method, which now has to look for a robots processing instruction in each document and use that to decide whether or not to call process() (index="yes|no") and/or findLinks() (follow="yes|no").

  public void spider(String systemID) {
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        Document document = parser.parse(systemID);

        // Look for a robots PI with follow="no"
        boolean index = true;
        boolean follow = true;
        NodeList children = document.getChildNodes();
        for (int i = 0; i < children.getLength(); i++) {
          Node child = children.item(i);
          int type = child.getNodeType();
          if (type == Node.PROCESSING_INSTRUCTION_NODE) {
            ProcessingInstruction pi
             = (ProcessingInstruction) child;
            if (pi.getTarget().equals("robots")) {
               PseudoAttributes pseudo = new PseudoAttributes(pi);
               if (pseudo.getValue("index").equals("no")) {
                 index = false;
               }
               if (pseudo.getValue("follow").equals("no")) {
                 follow = false;
               }
            }
          }
        } // end for

        if (index) process(document, systemID);

        if (follow) {
          Vector toBeVisited = new Vector();
          // search the document for uris,
          // store them in vector, and print them
          findLinks(
           document.getDocumentElement(), toBeVisited, systemID);

          Enumeration e = toBeVisited.elements();
          while (e.hasMoreElements()) {
            String uri = (String) e.nextElement();
            visited.add(uri);
            spider(uri);
          } // end while
        } // end if

      }

    }
    catch (SAXException e) {
      // Couldn't load the document,
      // probably not well-formed XML, skip it
    }
    catch (IOException e) {
      // Couldn't load the document,
      // likely network failure, skip it
    }
    finally {
      currentDepth--;
      System.out.flush();
    }

  }

     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows