XML Schema

XML Schema

XML Schema seeks to improve upon DTDs by adding more typing and quite a few more constructs than DTDs, as well as using XML as the constraint representation format. I'm going to spend relatively little time here talking about schemas, because they are a behind the scenes detail for Java and XML. In the chapters where you'll be working with schemas, I'll address any specific points you need to be aware of. However, the specification for XML Schema is so enormous that it would take up an entire book of explanation on its own. As a matter of fact, XML Schema by Eric van der Vlist (O'Reilly) is just that: an entire book on XML Schema.

XML Schema Definitions

Before getting into the actual schema constructs, take a look at a typical XML Schema root element:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
    attributeFormDefault="unqualified" version="4.0">

There's quite a bit going on here, including two different namespace declarations. First, the XML Schema namespace itself is attached to the xsd prefix, allowing separation of XML Schema constructs from the elements and attributes being constrained. Next, the dw namespace is defined; this particular example is from the IBM DeveloperWorks XML article template, and dw is used for DeveloperWorks-specific constructs.

Then, the values of attributeFormDefault and elementFormDefault are set to "unqualified". This allows XML instance documents to omit namespace declarations on elements and attributes. Qualifications are a fairly tricky idea, largely because attributes in XML do not fall into the default namespace; they must explicitly be assigned to a namespace. For a lot more on qualification, check out the relevant portion of the XML Schema specification at http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-schema.

Finally, the version attribute is given a value of "4.0". This is used to indicate the version of this particular schema, not of the XML Schema specification being used. The namespace assigned to the xsd prefix, http://www.w3.org/2001/XMLSchema, is actually the indicator as to which schema spec is being used, rather than an explicit version attribute.

Elements and attributes

Elements are defined with the element construct. You'll generally need to define your own data types by nesting a complexType tag within the element element, which defines the name of the element (through the name attribute). For example, here's an element definition from IBM's schema; this particular fragment constraints the code element:

<xsd:element name="code">
      <xsd:documentation xml:lang="en">
        <title>Define a code listing</title>
        <desc>The stylesheet allows code to be inline or section.  The contents of this element are displayed in a monospaced font, with all whitespace preserved from the original XML source.</desc>
    <xsd:complexType mixed="true">
      <xsd:choice minOccurs="0" maxOccurs="unbounded">
        <xsd:element ref="a"/>
        <xsd:element ref="b"/>
        <xsd:element ref="br"/>
        <xsd:element ref="font"/>
        <xsd:element ref="heading"/>
        <xsd:element ref="i"/>
        <xsd:element ref="sub"/>
        <xsd:element ref="sup"/>
        <xsd:group ref="specialCharacters"/>
      <xsd:attribute name="type" type="inline" use="required">
          <xsd:documentation xml:lang="en">
            <desc>The type of code listing.</desc>
      <xsd:attribute name="width">
          <xsd:documentation xml:lang="en">
            <desc>The width in characters of this code listing.</desc>

In this case, the element's name (code) is supplied, and then annotation is used to provide some basic commenting and documentation.

annotation is notoriously underused. Consider yourself lucky when you get an XML Schema as well documented as this example is. I've removed annotations from the remaining examples, just to save space and add some clarity (albeit while losing documentation).

complexType simply informs the schema parser that the element is not a predefined schema type, like string or integer. Setting the mixed attribute to true lets the schema parser know that the code element can have textual content, as well as nested elements. The default value for mixed is false; you have to explicitly specify when an element has both text and subelements.

Next, choice is used to supply a selection of subelements. If you omit choice and just list the elements, the order matters (elements must appear in the order that they are declared in the schema). But, by using choice, order becomes unimportant. Further, the minimum and maximum number of each element is unbounded (minOccurs="unbounded" and maxOccurs="unbounded" takes care of this). This effectively allows any number of any of these elements to appear, in any order. For each of these elements referenced (using ref), there must be a definition somewhere else in the schema (and may have its own complexType, referencing other elements).

Finally, the type and width attributes are defined and annotated, using the attribute keyword. So, there should be two things to take away from this definition:

  • Once you get the basic constructs in your head, it's fairly easy to read an XML Schema.

  • Even the definition of very simple elements is verbose; you'll rarely see an XML Schema that's fewer than several hundred lines.

Simple types

If you did have a so-called "simple type," you can avoid the complexType construct altogether:

<xsd:element name="text-data" type="xsd:string" />

Extending base types

You'll often want the simplicity of a simple type but the flexibility of XML Schema's more advanced constraints. For example, if you were defining a colorname element, you would probably want it as a simple string:

<xsd:element name="colorname" type="xsd:string" />

But, you can use XML Schema's enumeration feature to ensure only certain colors are allowed. In these cases, you have to use extension; but, since you're actually restricting the base type of string, rather than expanding on it, you'd use the restriction keyword:

<xsd:simpleType name="colorname">
  <xsd:restriction base="xsd:string">
    <xsd:enumeration value="blue" />
    <xsd:enumeration value="green" />
    <xsd:enumeration value="red" />

On the other hand, extension is used when you're taking a base type and adding to it:

<xsd:element name="title">
      <xsd:extension base=" xsd:string">
        <xsd:attribute name="isbn" type="xsd:string"/>

Here, the title element is based on a simple string, but adds an additional attribute (isbn, also a string).

For Java programmers, the distinction between extension and restriction is not as obvious; we're used to extending (even if the subclass ends up adding restrictions to types it might accept). In XML Schema, you use restriction to further constrain a type, and you use extension to broaden a type.

Although I've barely scratched the surface of XML Schema, this should at least give you a rough idea of the major constructs; it's certainly enough to get you through this book without too much trouble.

Generating XML Schemas from Instance Documents

You already know about Relaxer from the previous section "Generating DTDs from XML Instance Documents." The same tool works with XML Schemas, using the -xsd option:

relaxer -xsd toc.xml

You'll get an XSD file (in this case, toc.xsd). For the Eclipse table of contents, the resulting schema is shown in Figure.

The XML Schema generated by Relaxer automatically assigns a no-URL namespace as the default, if none is specified in the instance document

<?xml version="1.0" encoding="UTF-8" ?>
<xsd:schema xmlns=""
  <xsd:element name="toc" type="toc"/>
  <xsd:complexType name="toc">
      <xsd:element maxOccurs="unbounded" minOccurs="1" 
                   name="topic" type="topic"/>
    <xsd:attribute name="label" type="xsd:token"/>
  <xsd:complexType name="topic">
      <xsd:element name="link" type="link"/>
    <xsd:attribute name="href" type="xsd:token"/>
    <xsd:attribute name="label" type="xsd:token"/>
  <xsd:complexType name="link">
    <xsd:attribute name="toc" type="xsd:token"/>

Compare this to Figure, and you begin to see how verbose XML Schema really is! As in the case of autogeneration of DTDs, the more instance documents you can supply to Relaxer, the more accurate the resulting XML Schema.

Generating XML Schemas from a DTD

As the XML community moves away from DTDs to either XML Schema or RELAX NG, you'll need to convert many of your DTDs to a new constraint model. The DTD2XS tool at http://www.lumrix.net/xmlfreeware.php is perfect for just this use-case. Download the tool, and expand it to somewhere easily added to your classpath (like /usr/local/java/dtdxs). On Unix/Linux/Mac OS X:

export CLASSPATH=$CLASSPATH:/usr/local/java/dtd2xs

and on Windows:

set CLASSPATH=%CLASSPATH%;c:\java\dtd2xs

Unfortunately, you have to copy the complextype.xsl file, from the DTD2XS distribution, into the directory you're working from (or always convert from the dtdxs directory, which is equally inconvenient).

Now just give the tool a DTD to convert:

[bmclaugh] java dtd2xsd toc.dtd > toc-schema.xsd

dtd2xs: dtdURI file:////Users/bmclaugh/Documents/O'Reilly/Writing/Java and XML 3rd/subs/code/ch02/toc.dtd
dtd2xs: resolveEntities true
dtd2xs: ignoreComments true
dtd2xs: commentLength 100
dtd2xs: commentLanguage null
dtd2xs: conceptHighlight 2
dtd2xs: conceptOccurrence 1
dtd2xs: conceptRelation element attribute
dtd2xs: load DTD ... done
dtd2xs: remove comments from DTD ... done
dtd2xs: DOM translation ...
... done
dtd2xs: complextype.xsl ... done
dtd2xs: add namespace ... done

The name of the tool is DTD2XS, but the Java class to execute is dtd2xsd. Another slightly confusing aspect of the tool; but, it's workable, and that's what's important.

The resulting XML Schema is shown in Figure.

The output from DTD2XS isn't the prettiest you'll ever see, but it usually gets the job done just fine

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="topic">
<xs:sequence minOccurs="0">
<xs:element ref="link"/>
<xs:attribute name="href" type="xs:string"/>
<xs:attribute name="label" type="xs:string" use="required"/>
<xs:element name="toc">
<xs:element maxOccurs="unbounded" ref="topic"/>
<xs:attribute name="label" type="xs:string" use="required"/>
<xs:element name="link">
<xs:attribute name="toc" type="xs:string" use="required"/>

Validating XML Against an XML Schema

Finally, you should be able to validate your documents against an XML Schema (without resorting to programming, which is detailed in later chapters). As in "Validating XML Against a DTD," xmllint does the trick. First, though, you need to reference your schema in your instance document; this is quite a bit different from using a DOCTYPE definition, though.

Referencing a schema for nonnamespaced documents

If you're not using namespaces in the instance document, here's what you'd use:

<dw-document xsi:noNamespaceSchemaLocation="dw-document-4.0.xsd"

You can use URLs, like http://www.ibm.com/xsd/dw-document-4.0.xsd, as well as local references, when pointing to an XML Schema.

dw-document is the root element here, and it defines the xsi namespace. You should always use the same URI for this declaration (http://www.w3.org/2001/XMLSchema-instance), as that's what schema-aware parsers are expecting.

Some parsers ignore the URI, while others check against it. In either case, it's better to just use the same (correct) URI in every document, and not worry about it.

Since there is no namespace being constrained, use the noNamespaceSchemaLocation attribute to indicate where to find the XML Schema (again, used to constrain all portions of the document not in a namespace).

Referencing a schema for namespaced documents

If you are using namespaces, you'll need to pair each namespace with a schema to validate against:

<dw-document xmlns="http://www.ibm.com/developerWorks"

schemaLocation is used, instead of noNamespaceSchemaLocation, and it takes two arguments (separated by a space; that space appears as a line break in the printed book). The first value is the namespace to constrain, and the second is the schema location.

The XML Schema specification allows for multiple pairs of namespace URI/schema URI combinations, although that becomes difficult to accurately represent in a fixed-margin book.

Validating against a schema

Now invoke xmllint with the --schema option:

[bmclaugh] xmllint --schema dw-document-4.0.xsd index.xml --noout
index.xml validates

Errors are reported, and you can easily fix them.

 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows