March 14, 2011, 2:21 a.m.
posted by factorial
XML Schema seeks to improve upon DTDs by adding more typing and quite a few more constructs than DTDs, as well as using XML as the constraint representation format. I'm going to spend relatively little time here talking about schemas, because they are a behind the scenes detail for Java and XML. In the chapters where you'll be working with schemas, I'll address any specific points you need to be aware of. However, the specification for XML Schema is so enormous that it would take up an entire book of explanation on its own. As a matter of fact, XML Schema by Eric van der Vlist (O'Reilly) is just that: an entire book on XML Schema.
XML Schema Definitions
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:dw="http://www.ibm.com/developerWorks/" elementFormDefault="unqualified" attributeFormDefault="unqualified" version="4.0">
There's quite a bit going on here, including two different namespace declarations. First, the XML Schema namespace itself is attached to the xsd prefix, allowing separation of XML Schema constructs from the elements and attributes being constrained. Next, the dw namespace is defined; this particular example is from the IBM DeveloperWorks XML article template, and dw is used for DeveloperWorks-specific constructs.
Then, the values of attributeFormDefault and elementFormDefault are set to "unqualified". This allows XML instance documents to omit namespace declarations on elements and attributes. Qualifications are a fairly tricky idea, largely because attributes in XML do not fall into the default namespace; they must explicitly be assigned to a namespace. For a lot more on qualification, check out the relevant portion of the XML Schema specification at http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-schema.
Finally, the version attribute is given a value of "4.0". This is used to indicate the version of this particular schema, not of the XML Schema specification being used. The namespace assigned to the xsd prefix, http://www.w3.org/2001/XMLSchema, is actually the indicator as to which schema spec is being used, rather than an explicit version attribute.
Elements and attributes
Elements are defined with the element construct. You'll generally need to define your own data types by nesting a complexType tag within the element element, which defines the name of the element (through the name attribute). For example, here's an element definition from IBM's schema; this particular fragment constraints the code element:
<xsd:element name="code"> <xsd:annotation> <xsd:documentation xml:lang="en"> <title>Define a code listing</title> <desc>The stylesheet allows code to be inline or section. The contents of this element are displayed in a monospaced font, with all whitespace preserved from the original XML source.</desc> </xsd:documentation> </xsd:annotation> <xsd:complexType mixed="true"> <xsd:choice minOccurs="0" maxOccurs="unbounded"> <xsd:element ref="a"/> <xsd:element ref="b"/> <xsd:element ref="br"/> <xsd:element ref="font"/> <xsd:element ref="heading"/> <xsd:element ref="i"/> <xsd:element ref="sub"/> <xsd:element ref="sup"/> <xsd:group ref="specialCharacters"/> </xsd:choice> <xsd:attribute name="type" type="inline" use="required"> <xsd:annotation> <xsd:documentation xml:lang="en"> <desc>The type of code listing.</desc> </xsd:documentation> </xsd:annotation> </xsd:attribute> <xsd:attribute name="width"> <xsd:annotation> <xsd:documentation xml:lang="en"> <desc>The width in characters of this code listing.</desc> </xsd:documentation> </xsd:annotation> </xsd:attribute> </xsd:complexType> </xsd:element>
complexType simply informs the schema parser that the element is not a predefined schema type, like string or integer. Setting the mixed attribute to true lets the schema parser know that the code element can have textual content, as well as nested elements. The default value for mixed is false; you have to explicitly specify when an element has both text and subelements.
Next, choice is used to supply a selection of subelements. If you omit choice and just list the elements, the order matters (elements must appear in the order that they are declared in the schema). But, by using choice, order becomes unimportant. Further, the minimum and maximum number of each element is unbounded (minOccurs="unbounded" and maxOccurs="unbounded" takes care of this). This effectively allows any number of any of these elements to appear, in any order. For each of these elements referenced (using ref), there must be a definition somewhere else in the schema (and may have its own complexType, referencing other elements).
<xsd:element name="text-data" type="xsd:string" />
Extending base types
You'll often want the simplicity of a simple type but the flexibility of XML Schema's more advanced constraints. For example, if you were defining a colorname element, you would probably want it as a simple string:
<xsd:element name="colorname" type="xsd:string" />
But, you can use XML Schema's enumeration feature to ensure only certain colors are allowed. In these cases, you have to use extension; but, since you're actually restricting the base type of string, rather than expanding on it, you'd use the restriction keyword:
<xsd:simpleType name="colorname"> <xsd:restriction base="xsd:string"> <xsd:enumeration value="blue" /> <xsd:enumeration value="green" /> <xsd:enumeration value="red" /> </xsd:restriction> </xsd:simpleType>
<xsd:element name="title"> <xsd:complexType> <xsd:simpleContent> <xsd:extension base=" xsd:string"> <xsd:attribute name="isbn" type="xsd:string"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element>
Here, the title element is based on a simple string, but adds an additional attribute (isbn, also a string).
Although I've barely scratched the surface of XML Schema, this should at least give you a rough idea of the major constructs; it's certainly enough to get you through this book without too much trouble.
Generating XML Schemas from Instance Documents
relaxer -xsd toc.xml
You'll get an XSD file (in this case, toc.xsd). For the Eclipse table of contents, the resulting schema is shown in Figure.
The XML Schema generated by Relaxer automatically assigns a no-URL namespace as the default, if none is specified in the instance document
Compare this to Figure, and you begin to see how verbose XML Schema really is! As in the case of autogeneration of DTDs, the more instance documents you can supply to Relaxer, the more accurate the resulting XML Schema.
Generating XML Schemas from a DTD
As the XML community moves away from DTDs to either XML Schema or RELAX NG, you'll need to convert many of your DTDs to a new constraint model. The DTD2XS tool at http://www.lumrix.net/xmlfreeware.php is perfect for just this use-case. Download the tool, and expand it to somewhere easily added to your classpath (like /usr/local/java/dtdxs). On Unix/Linux/Mac OS X:
and on Windows:
Unfortunately, you have to copy the complextype.xsl file, from the DTD2XS distribution, into the directory you're working from (or always convert from the dtdxs directory, which is equally inconvenient).
Now just give the tool a DTD to convert:
[bmclaugh] java dtd2xsd toc.dtd > toc-schema.xsd dtd2xs: dtdURI file:////Users/bmclaugh/Documents/O'Reilly/Writing/Java and XML 3rd/subs/code/ch02/toc.dtd dtd2xs: resolveEntities true dtd2xs: ignoreComments true dtd2xs: commentLength 100 dtd2xs: commentLanguage null dtd2xs: conceptHighlight 2 dtd2xs: conceptOccurrence 1 dtd2xs: conceptRelation element attribute dtd2xs: load DTD ... done dtd2xs: remove comments from DTD ... done dtd2xs: DOM translation ... ... done dtd2xs: complextype.xsl ... done dtd2xs: add namespace ... done
The resulting XML Schema is shown in Figure.
The output from DTD2XS isn't the prettiest you'll ever see, but it usually gets the job done just fine
Validating XML Against an XML Schema
Finally, you should be able to validate your documents against an XML Schema (without resorting to programming, which is detailed in later chapters). As in "Validating XML Against a DTD," xmllint does the trick. First, though, you need to reference your schema in your instance document; this is quite a bit different from using a DOCTYPE definition, though.
Referencing a schema for nonnamespaced documents
<dw-document xsi:noNamespaceSchemaLocation="dw-document-4.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
dw-document is the root element here, and it defines the xsi namespace. You should always use the same URI for this declaration (http://www.w3.org/2001/XMLSchema-instance), as that's what schema-aware parsers are expecting.
Since there is no namespace being constrained, use the noNamespaceSchemaLocation attribute to indicate where to find the XML Schema (again, used to constrain all portions of the document not in a namespace).
Referencing a schema for namespaced documents
<dw-document xmlns="http://www.ibm.com/developerWorks" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/developerWorks dw-document-4.0.xsd">
schemaLocation is used, instead of noNamespaceSchemaLocation, and it takes two arguments (separated by a space; that space appears as a line break in the printed book). The first value is the namespace to constrain, and the second is the schema location.
Validating against a schema
[bmclaugh] xmllint --schema dw-document-4.0.xsd index.xml --noout index.xml validates