March 21, 2011, 11:42 p.m.
posted by pitbull
The Growing Importance of XML
You can see from the list of new data-oriented features in version 2.0 of the .NET Framework that there is considerable emphasis on XML as a data persistence, transmission, and manipulation format. XML was originally seen as a way to expose information that may or may not follow some structured or hierarchical format, and the XML standards allow XML to be used in a huge range of ways for storing information of almost any type.
The Changing XML Landscape
The System.Xml version 1.0 APIs were developed at a time when XML standards were quite different than those that exist today. The predominant standards at the end of 1999 and early 2000, when the initial APIs for version 1.0 were laid down, were XML 1.0, DOM 1.0 and 2.0, XPath 1.0, and XSLT 1.0. Significant standards such as the XML Information Set (October 24, 2001, http://www.w3.org/TR/xml-infoset/), XML Schema (May 2001, http://www.w3.org/TR/xmlschema-1/), and SOAP 1.2 (CR December 2002, http://www.w3.org/TR/SOAP/) subsequently emerged. These standards are having a positive impact across the industry.
In the last two years there has been a significant shift from viewing XML purely as a text-based format to a data representation format epitomized by the XML Information Set specification, which was developed in recognition of the fact that the DOM API did not define the relationships between the information item types found within an XML document. An XML Information Set (often described just as an XML InfoSet) refers to the data types that can be found in an XML document after it has been parsed by an XML parser. This shift of emphasis toward XML defined in terms of information items now appears in many of the W3C specifications. For example, to quote from the SOAP 1.2 specification:
What's more, this has been reflected in all of the significant emerging W3C specifications, such as XQuery, XPath 2.0, and XML Schema 1.0. Each of these specifications is dependent on the definition of XML as InfoSet items.
To draw a parallel: When working with a relational database, the way the data is actually written to disk (serialized) is completely hidden from the developer and the database administrator. This is desirable because it means that the method in use can be changed over time to make storage more efficient without breaking existing code. Rather than having to know about tracks and sectors on the disk, the developer or database administrator deals with the database through a level of abstraction that consists of tables, rows, and columns. This is the relational data model.
The XML InfoSet performs the same role of abstraction. The data may or may not be written as serialized XML 1.0 with angle brackets, but either way the data types that are exposed always appear consistent to the processing application. Handling XML InfoSets provides the ability to expose a multitude of data sources as XML, a technique often referred to as virtualization. The XPathNavigator class provides XML virtualization over a data source in System.Xml today, based on the XPath 1.0 data model, which uses the XML InfoSet item types.
Perhaps one of the most significant specifications to emerge recently is the XQuery and XPath 2.0 Data Model (http://www.w3.org/TR/query-datamodel/), which, being based on the XML InfoSet, defines the data model for the XQuery language in the same fashion as the relational data model does for the SQL query language. In many ways this should really be considered the XML data model, since its importance goes beyond the bounds of just the XQuery and XPath 2.0 languages.
XML and Relational Databases
Relational databases such as SQL Server are the origin of much of the data that is interchanged today and then typically stored again in other relational databases. Since relational systems today are overwhelmingly used to manage structured data, most of the XML exchanged also fits the relational model of structured data. Hence it is becoming increasingly common for XML to be focused in scenarios that involve structured data storage and transmission, for example, exposing relational data (the traditional "rows and columns" format), and hierarchical data (such as multiple related data tables).
This movement has also tended to simplify the XML content by concentrating mainly on the use of elements and attributes and avoiding what might be considered the more "esoteric" capabilities of XML (such as entities, notations, and so on). This is the domain of XML as a data interchange format rather than as a document markup language (which was the original driving force behind the introduction of XML).
An XML View is a mapping of a data source, which typically is in a non-XML format, through to an XML document—such as mapping a set of relational tables from a database. These XML Views virtualize the data, in that it isn't actually converted into an XML (serialized) format but merely shaped and transformed into a structure as if it were XML, based on the XML InfoSet data types. The use of XML Views over relational databases allows generation of XML documents and the hierarchical data structure, and also enables the use of XML technologies such as XQuery to provide heterogeneous queries over disparate data sources.
XML is also a more flexible data format than its relational equivalent because it has the ability to express the semi-structured and unstructured format that applies to much of the data lying outside of relational databases. Semi-structured refers, for example, to properties that appear only on a certain number of elements and not in a regular repeating (structured) fashion. Unstructured format is like that of a Microsoft Word document, typically consisting of "marked-up" data such as the content of the paragraphs in the text. XML has both the simplicity and flexibility to easily represent these differing data structures, and as a result XML has established itself as the lingua franca of data interchange for business and applications.
XML in Web Applications
XML as a data format is well suited to the environment in which Web-based applications live, being a platform-independent syntax for which simple and efficient parsers are widely available. In particular, any situation that involves remoting data (moving it across the network to a client or another server, as we describe in more detail later) can benefit from the use of XML. For example, clients or servers that run on disparate operating systems or software platforms can share XML-formatted data easily. Being persisted as Unicode means that even a 7-bit wire format (such as the "text-only" nature of the Web) for the intervening network does not impede simple and efficient data transmission.
In contrast to this, application- or operating system–specific data formats, particularly binary formats, may well cause all kinds of issues when used with disparate clients. As a simple example, some operating systems may treat 16-bit numbers as big-endian (the leftmost byte, at the lower memory address, is the most significant) while others treat them as little-endian (the leftmost byte is the least significant).
XML is also the ideal format for simply sending data to a client or receiving it from a client. For example, the Web site or application can provide a browser client with an XML document that is downloaded to the user's browser. Then it can be manipulated within the browser to display the data and (if required) edited and submitted back to the server for subsequent processing. Because the data is in a platform- and operating system–independent format, the code or application at either end of the network can be adapted and changed without reference to changes at the other end, as long as the data format remains the same.
Data Description through Schemas
Even more useful is the ability to use an XML schema with the XML data. A schema specifies the structure and allowable content of the XML document. This permits applications to be built where the client or recipient uses the schema to interpret the structure of the data, with the result that this data format can be changed independently as long as an appropriate schema is available.
An example of this approach is found in the new XML-format metabase used by Internet Information Services (IIS) 6.0. Each time the configuration of IIS is changed, an XML file containing the complete configuration of all the installed services is written to disk. However, as you install and remove services or add sites or virtual roots to IIS, the structure of this file changes in subtle ways to reflect the content it will store.
So, as well as writing the XML data file to disk, IIS also writes the current schema to disk. This means that, irrespective of which services are installed and how the current configuration looks, the saved configuration data can be correctly interpreted and restored each time.
Transformation and Presentation through XSLT Stylesheets
The concept of applying presentation information to an XML document has been one of the core aims of W3C almost since the inception of XML. The original proposals concentrated on two aspects: (1) transformations, which can change the content, structure, and ordering of the XML (to produce, for example, HTML output), and (2) the definition of a formatting language used by specialist client applications to apply style and layout information directly to the XML content.
In data management terms, the first of these two scenarios is the interesting one. The ability to apply a transformation that optionally can change the content, structure, and ordering of the elements means that an XSL or XSLT stylesheet can not only generate output that is aimed at presentation (for example, adding style definitions or HTML elements to the content) but also output a different XML document from that applied as the input.
XSLT is increasingly being used to "process" XML documents. Its recursive nature and reasonably simple syntax mean that it's possible to write generic stylesheets that can process different documents, and it has found a home in many areas and applications that need to convert XML data from one format to another (for example, Microsoft BizTalk).
XML Data Querying through XQuery
One of the more recent areas of development for XML is XML Query Language, or XQuery. This is a project aimed at deriving a technique for applying queries to an XML document that (in the broadest sense) more closely resemble SQL statements. While SQL is ideal as the query language for relational data, it is not designed to be a query language for XML data, since a query language should have as little impedance mismatch to the data model it queries as possible. As mentioned earlier, XQuery is to the XML data model what SQL is to the relational data model and is set to become the preferred query language for XML data querying. This will make XML much more approachable to those developers and administrators more used to working in a relational data environment.
Although XSLT can already perform most of the tasks that XQuery is aimed at, it's already becoming clear that XQuery can provide more options, as well as broader and simpler techniques for accessing multiple documents, than XSLT does. However, the choice between the two is more likely in the long term to be based on the developer's grasp of the techniques and language choice, in an identical manner to the language choice available in the .NET Framework today.
In a nutshell, XQuery provides support for:
XML Content Publishing
While Web applications are a major beneficiary of XML, they are not alone. Another arena where XML is growing is in content publishing. This is a wide and difficult area to define exactly, but in general the term applies to application and service integration that is aimed at exposing data or information to users.
Microsoft's own Content Management Server is an example, including features to build and deploy Web sites and Web Services, manage updates, perform workflow management, and integrate with Microsoft Office applications. It uses XML to transport and expose information in conjunction with templates, and to package content objects.
Similarly, SharePoint Server and Microsoft Office 2003 contain many features based on XML. In fact, XML is at the forefront of integration in Office 2003, with the new InfoPath data collection application storing data as XML and exposing it to the other "traditional" Office applications such as Word and Excel through Smart Documents based on XML.
XML in the .NET Framework
XML has been at the heart of the .NET Framework since version 1.0. Many aspects of data persistence, storage, and transmission depend wholly or partly on XML. For example, while the ADO Recordset object provided a custom binary format by default for persisted data, XML is the only data persistence format supported by the ADO.NET DataSet. The GetXml and WriteXml methods always return XML-formatted data.
Conversions between Relational and XML Data
Figure demonstrates some of the ways that XML is used within the .NET Framework for access to and conversion of relational data. The ADO.NET objects SqlCommand and DataSet can export or expose their content as XML, and the DataSet can also export the matching schema.
You can also see that the XmlDataDocument can be used to expose XML from relational data, using an existing DataSet as the source of the XmlDataDocument instance. What's new and exciting in version 2.0, however, is shown in the lower section of the schematic. These are the new features of the XPathDocument and its associated new classes. An XmlAdapter provides the link between the XPathDocument and a relational database. The XPathChangeNavigator and XPathEditor are used to navigate through and update the XML content, after which the XmlAdapter can push the changes back into the database.
XML Serialization and Remoting
The .NET Framework also includes a class called XmlSerializer (see Figure) that can be used to serialize objects and class instances to XML—including custom classes you declare—and deserialize them from XML. This makes it easy to build applications that interchange data represented as objects, such as a purchase order or invoice object generated from custom classes.
XML serialization is useful when you need to remote objects (rather than just plain data) to another location, or even just to another tier of your application. However, if you want to remote only the data itself, another useful feature of the XML classes is that the XmlDataDocument can be used as an alternative to a DataSet. It will maintain all of the schema information required to resurrect the data after transmission — including the data types, relations, and so on. The XmlDataDocument can be filled directly with XML of the required format, which can include nested or related data. Alternatively, it can be instantiated from an existing DataSet, as long as a schema is available.
However, as mentioned earlier, the XPathDocument can also be remoted. Because it implements change tracking, the remoting format has to be able to persist the changes as well as the "current" values of nodes in the document in case this information is required to be persisted in an XPathDocument regenerated from the serialized data. Three scenarios are supported.
The XPathDocument serializes values at Node level. In other words, every node in the document gets serialized and is then available when the document is rebuilt. In an XML document, every element and attribute is a node, so the complete data content of the document is maintained.
Serialization is actually useful not only when remoting data but also for supporting off-line data access and editing. After using an XmlAdapter to fill an XPathDocument, you can save the data to a local disk and disconnect from the database, then work with the data without maintaining a connection. Later you can reconnect and resynchronize the data in the database using the changes in the XPathDocument in much the same way as you would in the relational world.
Migration of the SQLXML 3.0 Technology to .NET
One of the aims of version 2.0 of the .NET Framework is to move the existing SQLXML technology to the .NET platform. SQLXML was originally developed as an add-on for SQL Server 7.0. The aim was to make it easy to extract data from SQL Server as XML and to use specially formatted XML updategrams to push data updates back into SQL Server.
The original Technology Preview of SQLXML involved installing a "filter" that sits in SQL Server and responds to queries that contain the FOR XML keywords. SQL Server 2000 has the SQLXML version 3.0 technology built in, and it automatically detects and processes SQL statements that are XML queries. For example, the following SQL statement:
SELECT * FROM Country WHERE Name LIKE 'U%' FOR XML AUTO
returns a series of XML elements that represent each matching row in the database table, with each column represented as an attribute of the row:
<Country Name="USA" Continent="America" Size="Very large"/> <Country Name="UK" Continent="Europe" Size="Quite small" /> ... etc ...
These elements can easily be wrapped in a suitable preamble and root element to turn them into an XML document. In .NET, SQLXML queries that return XML elements are handled by an XmlReader returned from a call to the ExecuteXmlReader method of a Command object connected to the database.
However, it would be useful to be able to use this type of data query, and return XML, with databases other than SQL Server and to make it more compatible with the techniques used in the Framework. This is achieved in version 2.0 of the .NET Framework through the XmlAdapter and XPathDocument you saw in Figure, though in the current release this is limited to the SQL Server "Yukon" version. Instead of having to learn complex rules for specifying the format of the data returned from the database and all the different options available for updategrams, you use an approach and a syntax that are similar to those used in relational data access through the ASP.NET DataAdapter.
Server-Side Data Binding to XML
In version 2.0, the .NET Framework also supports server-side UI data binding in Web Forms (ASP.NET) and Windows Forms (executable) applications to XML documents stored in the XPathDocument and XmlDocument classes. This provides a fast and efficient technique for displaying data that is persisted as XML in Web pages and Web applications and in your .NET executable programs.