C++ Implementation





C++ Implementation

MSXML makes the validation task a bit easier in some ways and a bit harder in others. Validation is easier once you've figured out how MSXML does things, but figuring out how it does things can be kind of hard.

One of the benefits of MSXML is that it has a separate validate method in addition to the load method. This is very handy, but it is also a bit confusing because the default behavior of the load method is to validate while parsing (a feature of MSXML that is not very well documented). This behavior can be changed, as we'll see with input validation. That aside, separating loading and validation is handy because when XML is our output, once we have built the DOM Document in memory we can call the validate method to validate the DOM Document against its schema before we save it.

There is, however, a bit of setup involved in validation and there are some "gotchas" lurking out there to bite unwary passersby. Let's look at input validation first because it is simpler.

Input Validation in XMLToCSVBasic.cpp

As I noted, the default behavior of the load method is to validate while parsing. So, if we haven't specified validation on the command line, we want to disable validation while loading. We do that by setting the validateOnParse property of the DOM Document to false.

spDocInput->validateOnParse = VARIANT_FALSE;

But what should we do if we want to validate? Do we perform validation while parsing via the load method, or do we just load without validation and then call validate? The quick answer is "both." We want to validate while parsing and call the validate method. Why both calls? The main reason is that the validate method does not report as much error information in the ParseError object as does the load method. So, we want to use load, with the default behavior of validating while parsing, as our primary validation tool. However, the load method by itself is not sufficient. In my testing I found that load was unable to report an error in cases where either there was no schema specified or there was a problem with the schema. As a result, for the most reliable validation we want to call both methods. Since we don't need to do anything to change the default behavior of load, we only need to add the call to validate. Here's the code to add to XMLToCSVBasic.cpp.

Validation Code in XMLToCSVBasic.cpp
//  Validate the input document
if (boValidate)
{
  spParseError = spDocInput->validate();
  if( spParseError->errorCode != S_OK)
  {
    cerr << "Validation Error" << endl;
    displayParseError(spParseError);
    throw cValidationError;
  }
}

The only real trick here is that while the load operation returns just an HRESULT (you need to explicitly call the Document's getParseError), the validate method returns an IXMLDOMParseError. We pass that to the displayParseError routine and then throw an exception to exit the try block. Validation errors typically show only reason text and error codes, not any other useful information.

Output Validation in CSVToXMLBasic.cpp

There were a few idiosyncrasies with validating XML as input, but validating output can be downright tricky. I expected output validation to be a piece of cake, but it took me a couple of hours to figure out what was going on and to make it work correctly. Going through the exercise may help you understand a bit more about how MSXML does things.

The first time I coded the routine I just added a block of code to CSVToXMLBasic.cpp that was almost identical to the input validation snippet shown above in XMLToCSVBasic.cpp. It didn't work. I got a validation failure message indicating that "the root element had no associated DTD/schema."

MSXML stores schemas internally and makes them available through the IXMLDOMSchemaCollection/XMLSchemaCache object. I was already familiar with the "Validating an XML Document against an XML Schema Using C++" example in the MSXML online documentation. The example goes through a somewhat convoluted process of creating a schema collection, associating it with the instance document to be validated, creating and loading the schema document, and finally adding it to the schema collection before calling the instance document's validate method. I guessed that MSXML couldn't identify the schema merely from the root Element's noNamespaceSchemaLocation Attribute. So, I added similar code to my main routine. The document still didn't validate, but I was getting closer to the source of the problem. This time the validation failure message indicated that the noNamespaceSchemaLocation Attribute "is not defined in the DTD/Schema." So, it didn't recognize that Attribute as being part of my instance document's default target namespace. However, it didn't recognize the Attribute as being from the xsi namespace either!

This led me to review how I had added that Attribute. It occurred to me that it might be significant that MSXML does not offer the DOM Level 2 setAttribute NS method on the Element interface. This is one of the rare cases in which MSXML doesn't support the standards as well as Xerces. Since it wasn't offered, I had fallen back to adding the noNamespaceSchemaLocation Attribute using the setAttribute method. However, on further investigation the MSXML documentation advised me that a namespace qualified Attribute could not be added using that method. Instead the Attribute must be added using the Document interface's createNode method. So, I modified the code to add the Attribute, as you see in the next snippet.

Adding the noNamespaceSchemaLocation Attribute
//  Next set the schema location.
//  MSXML requires that namespace qualified Attributes
//  be created as Nodes, then set.
variant_t varType((short)NODE_ATTRIBUTE);
spSchemaLocationAttribute = spDocOutput->createNode(
  varType,"xsi:noNamespaceSchemaLocation",
  "http://www.w3.org/2001/XMLSchema-instance");
spEleRoot->setAttributeNode(spSchemaLocationAttribute);
// We can finally set it now.
spEleRoot->setAttribute("xsi:noNamespaceSchemaLocation",
  cSchemaFileFullPath);

This finally worked: The instance document validated before I called the save method. However, it made me wonder whether or not all the schema cache stuff was really necessary. After all, we didn't need it on input validation. So, I commented out the code that created the schema collection, read the schema document, and so on, then just cut straight to the validate method. It worked. It seems that since MSXML didn't recognize the noNamespaceSchemaLocation Attribute as being in the xsi namespace, it didn't use the value of that Attribute to load the schema document. Once MSXML understood the Attribute properly, MSXML used it. I hacked out the schema cache code, and what you see in CSVToXMLBasic.cpp is almost identical to what you see in XMLToCSVBasic.cpp.

In some cases you may actually need to load and use an MSXML schema cache. Be aware that they exist, and if you have trouble with validation, try using one. However, the code works fine without them as long as you're doing pretty basic stuff such as I show in these utilities. I'm generally not a fan of writing code I don't have to write, so you won't see me using schema caches.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows