Distinguish Text from Markup





Item 9. Distinguish Text from Markup

All legal text characters that can appear anywhere in an XML document can appear in #PCDATA. This includes characters like < and & that may have to be escaped with character or entity references. When an API presents the content of a node containing such a character to your code, it will give you the actual character, not the escaping text. Similarly, when you create such a node, the string you use should contain the actual character, not the entity or character reference.

Consider the following DocBook programlisting element. A CDATA section is used to embed a literal sequence of XML text.

<programlisting><![CDATA[<value>
  <double>28657</double>
 </value>]]></programlisting>

Everything inside the CDATA section is content, not markup. The content of this programlisting element is the text shown below.

<value>
  <double>28657</double>
</value>

A CDATA section is not required for this trick to work. For instance, consider the following variation of the above element.

<programlisting>&lt;value&gt;
  &lt;double&gt;28657&lt;/double&gt;
 &lt;/value&gt;</programlisting>

The content of this element is exactly the same.

<value>
  <double>28657</double>
</value>

In this case the markup of the entity references &lt; and &gt; is resolved to produce the text < and >. However, that's just syntax sugar. It does not affect the content in any way.

Now consider the reverse problem. Suppose you're creating an XML document in something at least a little more XML-aware than a text editor. Possibilities include:

  • A tree-based editor like <Oxygen/> or XMLSPY

  • A WYSIWYG application like OpenOffice Writer or Apple's Keynote that saves its data into XML

  • A programming API such as DOM, JDOM, or XOM

In all cases, the creating tool will provide separate means to insert markup and text. The tool is responsible for escaping any reserved characters like <, >, or & when it saves the document. You do not need to do this. Indeed, if you try to pass something like &lt;double&gt;28657&lt;/double&gt; into a method that expects to get plain text, it will actually save something like &amp;lt;double&amp;gt;28657&amp;lt;/double&amp;gt;.

Similarly, you cannot type <double>28657</double> into a user interface widget that creates text and expect it to create an element. If you try it, in the serialized document you will get something like &lt;double&gt;28657&lt;/double&gt;. Instead, you should use the user interface widget or method call designed for creating a new element.

The key thing to remember is this: Just because something looks like an XML tag does not always mean it is an XML tag. Context matters. XML documents are made of markup that sometimes surrounds PCDATA, but that's the limit of the nesting. You can put PCDATA inside markup, and you can put markup inside markup, but you can't put markup inside PCDATA. CDATA sections are just an alternative means of escaping text. They are not a way to embed markup inside PCDATA.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows