XML Overview





XML Overview

While you may already be familiar with XML, it is important to understand XML concepts from the point of view of applications handling XML documents. With this knowledge, you are in a better position to judge the impact of your design decisions on the implementation and performance of your XML-based applications.

Essentially, XML is a markup language that enables hierarchical data content extrapolated from programming language data structures to be represented as a marked-up text document. As a markup language, XML uses tags to mark pieces of data. Each tag attempts to assign meaning to the data associated with it; that is, transform the data into information. If you know SGML (Standard Generalized Markup Language) and HTML (HyperText Markup Language), then XML will look familiar to you. XML is derived from SGML and also bears some resemblance to HTML, which is also a subset of SGML. But unlike HTML, XML focuses on representing data rather than end-user presentation. While XML aims to separate data from presentation, the end-user presentation of XML data is nevertheless specifically addressed by additional XML-based technologies in rich and various ways.

Although XML documents are not primarily intended to be read by users, the XML specification clearly states as one of its goals that "XML documents should be human-legible and reasonably clear." This legibility characteristic contributed to XML's adoption. XML supports both computer and human communications, and it ensures openness, transparency, and platform-independence compared to a binary format.

A grammar along with its vocabulary (also called a schema in its generic acception) defines the set of tags and their nesting (the tag structure) that may be allowed or that are expected in an XML document. In addition, a schema can be specific to a particular domain, and domain-specific schemas are sometimes referred to as markup vocabularies. The Document Type Definition (DTD) syntax, which is part of the core XML specification, allows for the definition of domain-specific schemas and gives XML its "eXtensible" capability. Over time, there have been an increasing number of these XML vocabularies or XML-based languages, and this extensibility is a key factor in XML's success. In particular, XML and its vocabularies are becoming the lingua franca of business-to-business (B2B) communication.

In sum, XML is a metalanguage used to define other markup languages. While tags help to describe XML documents, they are not sufficient, even when carefully chosen, to make a document completely self-describing. Schemas written as DTDs, or in some other schema language such as the W3C XML Schema Definition (XSD), improve the descriptiveness of XML documents since they may define a document's syntax or exact structure. But even with the type systems introduced by modern schema languages, it is usually necessary to accompany an XML schema with specification documents that describe the domain-specific semantics of the various XML tags. These specifications are intended for application developers and others who create and process the XML documents. Schemas are necessary for specifying and validating the structure and, to some extent, the content of XML documents. Even so, developers must ultimately build the XML schema's tag semantics into the applications that produce and consume the documents. However, thanks to the well-defined XML markup scheme, intermediary applications such as document routers can still handle documents partially or in a generic way without knowing the complete domain-specific semantics of the documents.

The handling of the following XML document concepts may have a significant impact on the design and performance of an XML-based application:

  • Well-formedness— An XML document needs to be well formed to be parsed. A well-formed XML document conforms to XML syntax rules and constraints, such as:

    • The document must contain exactly one root element, and all other elements are children of this root element.

    • All markup tags must be balanced; that is, each element must have a start and an end tag.

    • Elements may be nested but they must not overlap.

    • All attribute values must be in quotes.

  • Validity— According to the XML specification, an XML document is considered valid if it has an associated DTD declaration and it complies with the constraints expressed in the DTD. To be valid, an XML document must meet the following criteria:

    • Be well-formed

    • Refer to an accessible DTD-based schema using a Document Type Declaration: <!DOCTYPE>

    • Conform to the referenced DTD

    With the emergence of new schema languages, the notion of validity is extended beyond the initial specification to other, non-DTD-based schema languages, such as XSD. For these non-DTD schemas, the XML document may not refer explicitly to the schema, though it may only contain a hint to the schema to which it conforms. The application is responsible for enabling the validation of the document. Regardless of any hints, an application may still forcefully validate this document against a particular schema. (See "Validating XML Documents" on page 139.)

  • Logical and physical forms— An XML document has one logical form that may be laid out potentially in numerous physical forms. The physical form (or forms) represent the document's storage layout. The physical form consists of storage units called entities, which contain either parsed or unparsed data. Parsed entities are invoked by name using entity references. When parsed, the reference is replaced by the contents of the entity, and this replacement text becomes an integral part of the document. The logical form is the entire document regardless of its physical or storage layout.

    An XML processor, in the course of processing a document, may need to find the content of an external entity—this process is called entity resolution. The XML processor may know some identifying information about the external entity, such as its name, system, or public identifier (in the form of a URI: URL or URN), and so forth, which it can use to determine the actual location of the entity. When performing entity resolution, the XML processor maps the known identifying information to the actual location of the entity. This mapping information may be accessible through an entity resolution catalog.

1 Document Type and W3C XML Schema Definitions

Originally, the Document Type Definition (DTD) syntax, which is part of the core XML 1.0 specification and became a recommendation in 1998, allowed for the definition of domain-specific schemas. However, with the growth in the adoption of XML (particularly in the B2B area), it became clear that the DTD syntax had some limitations. DTD's limitations are:

  • It uses a syntax that does not conform to other XML documents.

  • It does not support namespaces. However, with some cleverness, it is possible to create namespace-aware DTD schemas.

  • It cannot express data types. With DTD, attribute values and character data in elements are considered to be text (or character strings).

To address these shortcomings, the W3C defined the XML Schema Definition language (XSD). (XSD became an official recommendation of the W3C in 2001.) XSD addresses some of the shortcomings of DTD, as do other schema languages, such as RELAX-NG. In particular, XSD:

  • Is itself an application of XML based on the XML specification— An XML schema can be written and manipulated just like an XML document.

  • Supports namespaces— By supporting namespaces, XSD allows for modular schema design and permits the composition of XSD schema definitions. It particularly solves the problem of conflicting tag names, which can often occur with modularization.

  • Supports data types— XSD provides a type system that supports type derivation and restriction, in addition to supporting various built-in simple types, such as integer, float, date, and time.

The following convention applies to the rest of the chapter: The noun "schema" or "XML schema" designates the grammar or schema to which an XML document must conform and is used regardless of the actual schema language (DTD, XSD, and so forth). Note: While XSD plays a major role in Web services, Web services may still have to deal with DTD-based schemas because of legacy reasons.

As an additional convention, we use the word "serialization" to refer to XML serialization and deserialization. We explicitly refer to Java serialization when referring to serialization supported by the Java programming language. Also note that we may use the terms "marshalling" and "unmarshalling" as synonyms for XML serialization and deserialization. This is the same terminology used by XML data-binding technologies such as JAXB.

2 XML Horizontal and Vertical Schemas

XML schemas, which are applications of the XML language, may apply XML to horizontal or vertical domains. Horizontal domains are cross-industry domains, while vertical domains are specific to types of industries. Specific XML schemas have been developed for these different types of domains, and these horizontal and vertical applications of XML usually define publicly available schemas.

Many schemas have been established for horizontal domains; that is, they address issues that are common across many industries. For example, W3C specifications define such horizontal domain XML schemas or applications as Extensible HyperText Markup Language (XHTML), Scalable Vector Graphics (SVG), Mathematical Markup Language (MathML), Synchronized Multimedia Integration Language (SMIL), Resource Description Framework (RDF), and so forth.

Likewise, there are numerous vertical domain XML schemas. These schemas or applications of XML define standards that extend or apply XML to a vertical domain, such as e-commerce. Typically, groups of companies in an industry develop these standards. Some examples of e-commerce XML standards are Electronic Business with XML (ebXML), Commerce XML (CXML), Common Business Language (CBL), and Universal Business Language (UBL).

When designing an enterprise application, developers often may define their own custom schemas. These custom schemas may be kept private within the enterprise. Or, they may be shared just with those partners that intend to exchange data with the application. It is also possible that these custom schemas may be publicly exposed. Such custom schemas or application-specific schemas are defined either from scratch or, if appropriate, they may reuse where possible existing horizontal or vertical schema components. Note that publishing schemas in order to share them among partners can be implemented in various ways, including publishing the schemas along with Web service descriptions on a registry (see "Publishing a Web Service" on page 101).

3 Other Specifications Related to XML

For those interested in exploring further, here is a partial list of the many specifications that relate to XML.

  • Document Object Model (DOM)— The Document Object Model is an interface, both platform and language neutral, that lets programs and scripts dynamically process XML documents. Using DOM, programs can access and update the content, structure, and style of documents.

  • Xpath— The Xpath specification defines an expression language for navigating and processing an XML source document, including how to locate elements in an XML document.

  • eXtensible Stylesheet Language Transformations (XSLT)— This specification, which is a subset of eXtensible Stylesheet Language (XSL), describes how to transform XML documents between different XML formats as well as non-XML formats.

  • Namespaces— This specification describes how to associate a URI with tags, elements, attribute names, and data types in an XML document, to resolve ambiguity when elements and attributes have the same names.

  • XML Information Set— This specification, often referred to as Infoset, provides the definitions for information in XML documents that are considered well formed according to the Namespaces criteria.

  • Canonical XML— This specification addresses how to resolve syntactic variations between the XML 1.0 and the Namespaces specifications to create the physical, canonical representation of an XML document.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows