Substituting XML Entities






Substituting XML Entities

Problem

You've parsed a document that contains internal XML entities. You want to substitute the entities in the document for their values.

Solution

To perform entity substitution on a specific text element, call its value method. If it's the first text element of its parent, you can call text on the parent instead.

Here's a simple document that defines and uses two entities in a single text node. We can substitute those entities for their values without changing the document itself:

	require 'rexml/document'

	str = %{<?xml version="1.0"?>
	<!DOCTYPE doc [
	  <!ENTITY product 'Stargaze'>
	  <!ENTITY version '2.3'>
	]>
	<doc>
	 &product; v&version; is the most advanced astronomy product on the market.
	</doc>}
	doc = REXML::Document.new str

	doc.root.children[0].value
	# => "\n Stargaze v2.3 is the most advanced astronomy product on the market.\n"
	doc.root.text
	# => "\n Stargaze v2.3 is the most advanced astronomy product on the market.\n"

	doc.root.children[0].to_s
	# => "\n &product; v&version; is the most advanced astronomy product on the market.\n"
	doc.root.write
	# <doc>
	# &product; v&version; is the most advanced astronomy program on the market.
	# </doc>

Discussion

Internal XML entities are often used to factor out data that changes a lot, like dates or version numbers. But REXML only provides a convenient way to perform substitution on a single text node. What if you want to perform substitutions throughout the entire document?

When you call Document#write to send a document to some IO object, it ends up calling Text#to_s on each text node. As seen in the Solution, this method presents a "normalized" view of the data, one where entities are displayed instead of having their values substituted in.

We could write our own version of Document#write that presents an "unnormalized" view of the document, one with entity values substituted in, but that would be a lot of work. We could hack Text#to_s to work more like Text#value, or hack Text#write to call the value method instead of to_s. But it's less intrusive to do the entity replacement outside of the write method altogether. Here's a class that wraps any IO object and performs entity replacement on all the text that comes through it:

	require 'delegate'
	require 'rexml/text'
	class EntitySubstituter < DelegateClass(IO)
	  def initialize(io, document, filter=nil)
	    @document = document
	    @filter = filter
	    super(io)

	  end

	  def <<(s)
	    super(REXML::Text::unnormalize(s, @document.doctype, @filter))
	  end
	end

	output = EntitySubstituter.new($stdout, doc)
	doc.write(output)
	# <?xml version='1.0'?><!DOCTYPE doc [
	# <!ENTITY product "Stargaze">
	# <!ENTITY version "2.3">
	# ]>
	# <doc>
	#  Stargaze v2.3 is the most advanced astronomy product on the market.
	# </doc>

Because it processes the entire output of Document#write, this code will replace all entity references in the document. This includes any references found in attribute values, which may or may not be what you want.

If you create a Text object manually, or set the value of an existing object, REXML assumes that you're giving it unnormalized text, and normalizes it. This can be problematic if your text contains strings that happen to be the values of entities:

	text_node = doc.root.children[0]
	text_node.value = "&product; v&version; has a catalogue of 2.3 " +
	                  "million celestial objects."

	doc.write
	# <?xml version='1.0'?><!DOCTYPE doc [
	# <!ENTITY product "Stargaze">
	# <!ENTITY version "2.3">
	# ]>
	# <doc>&product; v&version; has a catalogue of &version; million celestial objects.
	  </doc>

To avoid this, you can create a "raw" text node:

	text_node.raw = true
	doc.write
	# <?xml version='1.0'?><!DOCTYPE doc [
	# <!ENTITY product "Stargaze">
	# <!ENTITY version "2.3">
	# ]>
	# <doc>&product; v&version; has a catalogue of 2.3 million celestial objects.</doc>

	text_node.value
	# => "Stargaze v2.3 has a catalogue of 2.3 million celestial objects."
	text_node.to_s
	# => "&product; v&version; has a catalogue of 2.3 million celestial objects."

In addition to entities you define, REXML automatically processes five named character entities: the ones for left and right angle brackets, single and double quotes, and the ampersand. Each is replaced with the corresponding ASCII character.

	str = %{
	  <!DOCTYPE doc [ <!ENTITY year '2006'> ]>
	  <doc>&#169; &year; Komodo Dragon &amp; Bob Productions</doc>
	}

	doc = REXML::Document.new str
	text_node = doc.root.children[0]

	text_node.value
	# => "&copy; 2006 Komodo Dragon & Bob Productions"
	text_node.to_s
	# => "&copy; &year; Komodo Dragon &amp; Bob Productions"

"&copy;" is an HTML character entity representing the copyright symbol, but REXML doesn't know that. It only knows about the five XML character entities. Also, REXML only knows about internal entities: ones whose values are defined within the same document that uses them. It won't resolve external entities.

See Also



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows