XML Basics for Reading Your Documents






XML Basics for Reading Your Documents

XML is a language used to contain and describe data. In the case of Office Open XML, the data is your document content, and the description includes the settings required for that document to function in the applicable program as well as the settings you apply to the document.

Before you begin to explore a document’s XML, the subsections that follow provide a bit of background and basics to help you prepare for the task.

Reading a Markup Language

XML is a markup language. Just as you mark up a document while reviewing it-with comments, corrections, or margin notes-a markup language marks up data with descriptive information about that data.

If you’ve ever looked at the HTML source code of a Web page, you already have some experience with the type of language you’ll see throughout this primer. However, instead of paired formatting codes wrapped around text that you see in HTML (such as <b>text</b> to turn bold formatting on and then off), the Office Open XML Formats use paired codes nested in a hierarchy that compartmentalizes, organizes, and defines everything you need to know about your document.

The following example shows the word text along with its formatting definition. This word is part of a paragraph but is separated out in the source code (the markup) because it contains unique formatting. The bullets that follow the code sample explain in detail how to read this sample.

<w:r>
   <w:rPr>
         <w:b />
   </w:rPr>
   <w:t>text</w:t>
</w:r>
  • The w: that begins each line indicates that this information is describing an Office Word 2007 document. You will see different prefixes in your Microsoft Office Excel 2007 and Microsoft Office PowerPoint 2007 documents. Also notice that each code is surrounded by angle brackets (<>).

  • As with HTML source code, XML code used to describe content is usually paired, and the second of the pair (the end code) begins with a slash character.

  • The section of code shown in the preceding sample is known as a run, noted by the w:r that introduces the first line of code. A run is a region of document content that contains similar properties.

    To complete the structure, the entire content of the paragraph to which the word text belongs is stored between the two ends of a higher-level paired code, not shown here, that indicates the start and end of the paragraph (<w:p> and </w: p>). The collection of paragraphs (and any other content) in the body of the document is in turn positioned within another paired code (<w:body> and </w: body>).

  • The second and fourth lines in the sample comprise a paired code containing the formatting for the specified text. Notice that between those lines, the third line simply indicates that the specified text is bold <w:b />.

    Because formatting information in Office Open XML is stored in a structure that defines where the formatting is to be applied, the specific formatting itself doesn’t need a paired code. If the text for this sample were also italicized, for example, the code <w:i /> would appear on its own line, also between the lines of the same paired code that contains the bold statement. Also notice that, because the bold (or italicized) statements stand on their own, they include a slash at the end of the single code to indicate that there is no end code for this statement. You’ll see the slash at the end of other codes throughout this primer, wherever the item is not paired.

  • The specified paragraph text appears on the fifth line, between a pair of codes (<w:t> and </w:t>) that indicate it’s the text being described.

  • The last line in the preceding example is the end code that indicates the end of the description for this specified text.

If the preceding example seems to be quite a lot of work for one word, don’t lose heart. It’s just an example of how you see Word formatting applied to text in the XML markup, used here to demonstrate how code in the Office Open XML Formats is spelled out. Though it also serves to demonstrate why working in the XML wouldn’t be considered an equal alternative to the built-in program features for many document editing needs, that’s not the reason for this example. Understanding how to read XML structure will help you work more easily when you begin to use a document’s XML in ways that can simplify your work and expand the possibilities.

Don’t worry about trying to memorize any specific codes used in the preceding example. The important thing to take away from this is the general concept of how the XML code is structured. Everything in XML is organized and spelled out, like driving directions that take no turn for granted. So, though the example given might seem like a lot of code for very little content, the fact that it’s organized explicitly is the very thing that will make the tasks throughout this primer easy to understand even to those who are new to XML.

Note 

If you look at the markup for one of your own documents, you may see code similar to the preceding example along with additional codes labeled w:rsidR and rsidRPr, each followed by a set of numbers. Those codes and their corresponding numbers are a result of the feature Store Random Number To Improve Combine Accuracy, which you can find on the Privacy Options tab of the Trust Center.

Unless you intend to use the Combine feature (available from the Compare options on the Review tab) with a particular document, there’s no benefit to enabling this option (but it is on by default). For the sake of simplicity, since these codes are not essential to your documents, they’re not included in any XML samples throughout this chapter.

Understanding Key Terms

I’ll introduce terms as they arise for each task, but there are a few terms that can be useful to note up front.

  • The Office Open XML Formats are actually compressed folders containing a set of files that work together. ZIP technology (the .zip file extension) is the method used to compress the files into a single unit, and the set of files that comprise an Office Open XML Format document is referred to as the ZIP package.

  • Each file within the package is referred to as a document part. no

  • When you read about XML, you often come across the word schema. An XML schema is a set of standards and rules that define a given XML structure. For example, multiple schemas are available for defining different components of Office Open XML, and you’ll see reference to some of these in the document parts used for the tasks throughout this chapter. Anyone can freely use the schemas for the Office Open XML formats. Developers can also create their own custom schemas for custom document solutions. (Note, however, that creating schemas is an advanced XML skill that is beyond the scope of this chapter.)

    Image from book On the Resources tab of this book’s CD, find the schema for customizing the user interface in the 2007 release programs that use the Office Open XML Formats. You can open this file, named customUI.xsd, in the Microsoft Windows utility program Notepad to view its content and give yourself an idea of the type of information contained in an XML schema.

XML Editing Options

Most professional developers use Microsoft Visual Studio for editing XML, but you certainly don’t need to do that. You can use Notepad for the same purpose, or any of a wide range of programs from Microsoft Office SharePoint Designer 2007 to a number of freeware, shareware, and retail XML editors.

Many people who don’t need a professional development platform for their work will use a freeware or shareware XML editor to see the XML hierarchy in a tree structure that’s easy to read. When you edit XML in Notepad, it typically looks like running text with no manual line breaks.

For those who don’t want to install another program for this purpose, you can use Microsoft Internet Explorer to view the XML in a hierarchical tree structure and easily find what you need, and then use Notepad to edit the XML. This is the approach I use for the examples throughout this primer.

Note 

Find a link to the download page for XML Notepad 2007, a free XML editing tool from Microsoft, on the Resources tab of this book’s CD. XML Notepad provides both an editor and a viewer, along with features such as drag-and-drop editing and error checking. However, using the editor in XML Notepad requires some knowledge of XML language structure. So, for those who are seeing XML for the first time in this chapter, start with the Windows Notepad utility and consider moving up to XML Notepad once you get your bearings, if you find yourself yearning for a more structured editing environment.

That said, even if you’re not using XML Notepad 2007 regularly to edit your code, it can be a handy tool for understanding the structure of your code and troubleshooting syntax errors, as discussed later in this chapter. So, you might want to download it sooner than later.

When you open an XML file in Internet Explorer, you’re likely to see a bar across the top of the screen indicating that active content was disabled. Right-click that bar and activate content to be able to expand and collapse sections of your code by using the minus signs you see beside each level of code that contains sublevels. For example, here’s what the code shown earlier looks like when viewed in Internet Explorer.

-<w:r>
 -<w:rPr>
   <w:b />
  </w:rPr>
  <w:t>text</w:t>
 </w:r>

The same text in Notepad looks like this:

<w:r><w:rPr ><w:b/></w:rPr ><w:t>text</w:t></w:r>
Troubleshooting
Image from book

The document won’t open after I edit an XML file, but I know my code is correct.

Remember that a small syntax error (such as leaving off one of the angle brackets around a code) in one XML file within a document can cause that document to be unreadable. However, if you know that the code you typed is correct, there may be another reason that’s just as easy to resolve.

Some XML editors that display the XML code in an easily readable tree structure may add formatting marks (such as tabs or line breaks) when you add code to that XML structure. When this happens, these formatting marks can be interpreted as a syntax error (just like a missing bracket) and cause the document to which that XML file belongs to become unreadable in its native program.

If you don’t know how to recognize unwanted formatting marks in your XML editor or if the file won’t open in your XML editor, see “Using XML Notepad and Word to Help Find Syntax Errors” on page 1163 for steps to help you locate the error.

Image from book


 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows