The CharacterData Interface





The CharacterData Interface

The CharacterData interface is a generic superinterface for nodes that are composed mostly of text, including Text, CDATASection, and Comment. The CharacterData interface is almost never used directly. Rather, it is used as an instance of one of these three subinterfaces. But you almost always work with text, CDATA section, and comment nodes using the methods of the CharacterData interface.

Figure summarizes the CharacterData interface. This interface has methods that manipulate the text content of this node. As usual, it also inherits all the methods of its superinterface Node such as getParentNode() and getNodeValue().

7 The CharacterData Interface
package org.w3c.dom;

public interface CharacterData extends Node {

  public String getData() throws DOMException;
  public void   setData(String data) throws DOMException;
  public int    getLength();
  public String substringData(int offset, int length)
   throws DOMException;
  public void   appendData(String data) throws DOMException;
  public void   insertData(int offset, String data)
   throws DOMException;
  public void   deleteData(int offset, int length)
   throws DOMException;
  public void   replaceData(int offset, int length, String data)
   throws DOMException;

}

The getData() method returns a String containing the complete content of the node. Any escaped characters such as & or   are replaced by the actual characters they represent. The setData() method replaces the entire text content of the node. There's no need to escape the string you pass to this method. If the document is written out to a file or a stream, then the serialization code is responsible for escaping these characters. In memory, the type of the object is enough to determine whether a less-than sign is the start of a tag or just a less-than sign.

There are also methods to read and write only parts of the text content. The offsets are all zero based, as in Java's String class. For example, the following code fragment deletes the first six characters from the CharacterData object text:

text.delete(0, 6); 

Java's String type is a very good match for DOM strings. Each char in a Java String is a single UTF-16 code point. That is, most Unicode characters are represented by exactly one Java char. However, characters with code points greater than 65,535, such as many musical symbols, are represented by two chars each, one for each half of the surrogate pair representing the character in UTF-16. The getLength() method in this interface returns the number of UTF-16 code points, not the number of Unicode characters. This is also how the length() method in Java's String class behaves.

On Usenet, jokes that some people are likely to find offensive are often obscured by rotating the ASCII character set 13 places. That is, the first letter of the alphabet, A, is transformed into the fourteenth letter of the alphabet, N. The second letter of the alphabet, B, is transformed into the fifteenth letter of the alphabet, O, and so forth through M, which becomes Z. Then N is transformed into A, O into B, and so on through Z, which becomes M. It's not a particularly strong cipher, but it's enough to prevent people from accidentally reading something they don't want to read. It has the extra advantage of reversing itself. That is, running the cipher text through the rot-13 algorithm one more time restores the original text.

Figure is a simple program that obscures text nodes, comments, and CDATA sections by applying the rot-13 algorithm to them. The encoded documents are as well-formed and valid as the original documents. Only the character data gets changed, not the markup.[4] This program can also decode documents that are already encoded.

[4] ROT13XML could also encode attribute values and processing instructions without affecting well-formedness or validity, but because DOM does not represent these nodes as instances of CharacterData, I leave this as an exercise for the reader.

8 Rot-13 Encoder for XML Documents
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.DOMSource;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class ROT13XML {

  // note use of recursion
  public static void encode(Node node) {

    if (node instanceof CharacterData) {
      CharacterData text = (CharacterData) node;
      String data = text.getData();
      text.setData(rot13(data));
    }

    // recurse the children
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        encode(children.item(i));
      }
    }

  }

  public static String rot13(String s) {

    StringBuffer out = new StringBuffer(s.length());
    for (int i = 0; i < s.length(); i++) {
      int c = s.charAt(i);
      if (c >= 'A' && c <= 'M') out.append((char) (c+13));
      else if (c >= 'N' && c <= 'Z') out.append((char) (c-13));
      else if (c >= 'a' && c <= 'm') out.append((char) (c+13));
      else if (c >= 'n' && c <= 'z') out.append((char) (c-13));
      else out.append((char) c);
    }
    return out.toString();

  }

  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println("Usage: java ROT13XML URL");
      return;
    }

    String url = args[0];

    try {
      DocumentBuilderFactory factory
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder parser = factory.newDocumentBuilder();

      // Read the document
      Document document = parser.parse(url);

      // Modify the document
      ROT13XML.encode(document);

      // Write it out again
      TransformerFactory xformFactory
       = TransformerFactory.newInstance();
      Transformer idTransform = xformFactory.newTransformer();
      Source input = new DOMSource(document);
      Result output = new StreamResult(System.out);
      idTransform.transform(input, output);

    }
    catch (SAXException e) {
      System.out.println(url + " is not well-formed.");
    }
    catch (IOException e) {
      System.out.println(
      "Due to an IOException, the parser could not encode " + url
      );
    }
    catch (FactoryConfigurationError e) {
      System.out.println("Could not locate a factory class");
    }
    catch (ParserConfigurationException e) {
      System.out.println("Could not locate a JAXP parser");
    }
    catch (TransformerConfigurationException e) {
      System.out.println("Could not locate a TrAX transformer");
    }
    catch (TransformerException e) {
      System.out.println("Could not transform");
    }

  } // end main

}

The encode() method recursively descends the tree, applying the rot-13 algorithm to every CharacterData object it finds, whether a Comment, Text, or CDATASection. The algorithm itself is encapsulated in the rot13() method. Because both methods merely operate on their arguments but otherwise have no interaction with any state maintained in the class, I made them static. The main() method encodes a document at a URL typed on the command line, and then copies the result to System.out.

Here's a joke encoded by this program. You'll have to run the program if you want to find out what it says. :-)

D:\books\XMLJAVA>java ROT13XML joke.xml 
<?xml version="1.0" encoding="utf-8"?><joke>
  Gubhfnaqf bs crbcyr nggraq gur Oheavat Zna srfgviny rirel lrne
  va Arinqn'f Oynpx Ebpx Qrfreg. Guvf vf gur ovt uvccvr srfgviny,
  jurer crbcyr eha nebhaq anxrq, qevax, naq trg fgbarq,
  be nf Trbetr J. Ohfu yvxrf gb pnyy vg,
  trg ernql gb eha sbe cerfvqrag
</joke>

     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows