June 23, 2011, 1:37 a.m.
posted by reo
Substituting and Inserting Text
The next thing we want to do with the parser is to customize it a bit, so you can see how to get information it usually ignores. But before we can do that, you're going to need to learn a few more important XML concepts. In this section, you'll learn about:
1 Handling Special Characters
In XML, an entity is an XML structure (or plain text) that has a name. Referencing the entity by name causes it to be inserted into the document in place of the entity reference. To create an entity reference, the entity name is surrounded by an ampersand and a semicolon, like this:
Later, when you learn how to write a DTD, you'll see that you can define your own entities, so that &yourEntityName; expands to all the text you defined for that entity. For now, though, we'll focus on the predefined entities and character references that don't require any special definitions.
An entity reference like & contains a name (in this case, "amp") between the start and end delimiters. The text it refers to (&) is substituted for the name, like a macro in a C or C++ program. Figure shows the predefined entities for special characters.
A character reference like “contains a hash mark (#) followed by a number. The number is the Unicode value for a single character, such as 65 for the letter "A", 147 for the left-curly quote, or 148 for the right-curly quote. In this case, the "name" of the entity is the hash mark followed by the digits that identify the character.
2 Using an Entity Reference in an XML Document
Suppose you wanted to insert a line like this in your XML document:
Market Size < predicted
The problem with putting that line into an XML file directly is that when the parser sees the left-angle bracket (<), it starts looking for a tag name, which throws off the parse. To get around that problem, you put < in the file, instead of "<".
The results of the modifications below are contained in slideSample03.xml. (The browsable version is slideSample03-xml.html.) The results of processing it are shown in Echo07-03.
If you are following the programming tutorial, add the text highlighted below to your slideSample.xml file:
<!-- OVERVIEW --> <slide type="all"> <title>Overview</title> ... </slide> <slide type="exec"> <title>Financial Forecast</title> <item>Market Size < predicted</item> <item>Anticipated Penetration</item> <item>Expected Revenues</item> <item>Profit Margin </item> </slide> </slideshow>
When you run the Echo program on your XML file, you see the following output:
ELEMENT: <item> CHARS: Market Size CHARS: < CHARS: predicted END_ELM: </item>
The parser converted the reference into the entity it represents, and passed the entity to the application.
3 Handling Text with XML-Style Syntax
When you are handling large blocks of XML or HTML that include many of the special characters, it would be inconvenient to replace each of them with the appropriate entity reference. For those situations, you can use a CDATA section.
The results of the modifications below are contained in slideSample04.xml. (The browsable version is slideSample04-xml.html.) The results of processing it are shown in Echo07-04.
A CDATA section works like <pre>...</pre> in HTML, only more so—all whitespace in a CDATA section is significant, and characters in it are not interpreted as XML. A CDATA section starts with <![CDATA[and ends with ]]>. Add the text highlighted below to your slideSample.xml file to define a CDATA section for a fictitious technical slide:
... <slide type="tech"> <title>How it Works</title> <item>First we fozzle the frobmorten</item> <item>Then we framboze the staten</item> <item>Finally, we frenzle the fuznaten</item> <item><![CDATA[Diagram: frobmorten <-------------------------- fuznaten | <3> ^ | <1> | <1> = fozzle V | <2> = framboze Staten+ <3> = frenzle <2> ]]></item> </slide> </slideshow>
When you run the Echo program on the new file, you see the following output:
ELEMENT: <item> CHARS: Diagram: frobmorten <------------ -------fuznaten | <3> ^ | <1> | <1> = fozzle V | <2> = framboze Staten+ <3> = frenzle <2> END_ELM: </item>
You can see here that the text in the CDATA section arrived as one entirely uninterpreted character string.
4 Handling CDATA and Other Characters
The existence of CDATA makes the proper echoing of XML a bit tricky. If the text to be output is not in a CDATA section, then any angle brackets, ampersands, and other special characters in the text should be replaced with the appropriate entity reference. (Replacing left angle brackets and ampersands is most important; other characters will be interpreted properly without misleading the parser.)
But if the output text is in a CDATA section, then the substitutions should not occur to produce text like that in the example above. In a simple program like our Echo application, it's not a big deal. But many XML-filtering applications will want to keep track of whether the text appears in a CDATA section, in order to treat special characters properly.
One other area to watch for is attributes. The text of an attribute value could also contain angle brackets and semicolons that need to be replaced by entity references. (Attribute text can never be in a CDATA section, though, so there is never any question about doing that substitution.)
Later in this tutorial, you will see how to use a LexicalHandler to find out whether or not you are processing a CDATA section. Next, though, you will see how to define a DTD.