Describing the File Formats





Describing the File Formats

As discussed in Chapter 6, we're going to use our own XML-based language to describe the formats of our non-XML files. Although the different formats will have many characteristics in common, we'll have a different language for each. In addition, because some characteristics, such as those dealing with XML output, are not relevant for both to and from XML conversions, we'll have a separate language (and schema) for each direction. In this section I describe the organization, Elements, Attributes, and usage of the file description document. In the design section we'll review the schemas for the file description documents as well as the sample invoice and purchase order.

The CSV file description documents have three major sections, each represented by an Element that is a direct child of the root Element.

  • PhysicalCharacteristics: the CSV file characteristics

  • XMLOutputCharacteristics: XML output characteristics, required only when converting to XML

  • Grammar: CSV file grammar

The language that describes conversion from XML is a subset of the one that describes conversion to XML. Given this, the file description documents and the schemas that specify them are very similar. We'll review them in the upcoming section on high-level design.

The following three subsections describe the major sections of the file description documents. Each major section is handled by an Element that is a child of the file description document's root Element, CSVFileDescription.

CSV Physical Characteristics

The CSV file's physical characteristics are described in the PhysicalCharacteristics Element. For CSV formats we need to specify the record terminator and the column and text delimiters.

Figure shows the child Elements of the PhysicalCharacteristics Element. All are required.

XML Output Characteristics

Characteristics governing the output XML documents are described in the XMLOutputCharacteristics Element. This Element is used only when converting from CSV files to XML.

Figure shows the child Elements of the XMLOutputCharacteristics Element.

Child Elements of the PhysicalCharacteristics Element

Child Element

Attribute

Schema Language Data Type

Description

Allowable Values

RecordTerminator

value

union of U, W, and hexBinary

Specifies the physical record terminator as a UNIX-style line feed character, a Windows-style carriage return and line feed pair, or a hexadecimal value

U, W, or a two-character hexadecimal number from 00 through FF representing a single byte

ColumnDelimiter

value

union of a single token character and hexBinary

Column delimiter expressed as a literal character or a hexadecimal value

A single nonwhitespace character or a two-character hexadecimal number from 00 through FF representing a single byte

TextDelimiter

value

union of a single token character and hexBinary

Text delimiter expressed as a literal character or a hexadecimal value

A single nonwhitespace character or a two-character hexadecimal number from 00 through FF representing a single byte

CSV File Grammar

The data types of the columns and other characteristics of the CSV file (that is, the grammar) are described in the Grammar Element. We need to notice an important point, a new aspect of functionality that we are introducing beyond that provided in Chapter 3's basic utility. Because we are describing the grammar of the CSV file, we can also take the opportunity to specify the names of the Elements that we use in the corresponding XML representation. We are no longer restricted to the "Row" and "ColumnXX" Element names that we used in Chapters 2 and 3. While this does add minimal complexity to the processing, it allows us to assign semantically meaningful names to the Elements in the XML document.

Figure shows the Grammar Element and its children. All are required unless noted. Indentation in the Element column shows hierarchichal parent/child relationships. The Allowable Child Elements column lists the specific details of the hierarchy.

Child Elements of the XMLOutputCharacteristics Element

Child Element

Attribute

Schema Data Type

Description

Allowable Values, Restrictions, or Comments

SchemaLocationURL

value

anyURI

URL of the schema file for the out-put document. Will be written as the value of the root Element's noNamespaceSchemaLocation Attribute.

Optional. If not specified the noNamespaceSchemaLocation Attribute will not be written. An error will occur if output validation is requested and this Element is not present.

DocumentBreakColumn

value

NonNegativeInteger

Number of the column that dictates the start of a new document when its content changes (for example, an invoice number in the second column).

Required. If a value of zero is specified, the entire CSV file will be written to a single XML document in the output directory.

PartnerBreakColumn

value

NonNegativeInteger

Number of the column that dictates a different trading partner when its content changes (for example, a customer number in the first column).

Required. If a value of zero is specified, all output documents will be created in the output directory instead of creating a separate subdirectory for each trading partner. If DocumentBreakColumn is zero, this Element is ignored.

CSV Grammar Characteristics in the Grammar Element

Element

Allowable Child Element

Attribute

Schema Language Data Type

Description

Allowable Values, Restrictions, or Comments

Grammar

RowDescription

  

Describes the grammar of both the CSV file and the corresponding XML representation

The Grammar Element has only a single RowDescription child Element due to the restriction that all the rows in the CSV file have the same organization.

  

ElementName

NMTOKEN

Specifies the name of the document's root Element

When creating XML documents, the specified name is assigned to the document's root Element. When creating a CSV file, the input XML document's root Element must match this name.

RowDescription

ColumnDescription

  

Describes a row in a CSV file and the corresponding XML representation

At least one Column-Description child Element is required.

  

ElementName

NMTOKEN

Specifies the name of the Element representing a row

Maximum length reflects restriction on the length of Element names.

ColumnDescription

None

  

Describes a column in a CSV file and the corresponding XML representation

One ColumnDescription Element is required for each column in a CSV file row. All columns must be specified.

  

ElementName

NMTOKEN

Specifies the name of the Element representing a column

Maximum length reflects restriction on the length of Element names.

  

FieldNumber

positiveInteger

Specifies the number of a column, starting at one

Maximum value reflects restriction on the number of columns per row.

  

DataType

token

Specifies the data type of a column in the CSV file

The supported data types developed in this chapter are shown in Figure. The Grammar data type code values are used.

  

DelimitText

boolean

Indicates (for XML to CSV conversion) whether or not to use the Text-Delimiter when writing the column

Optional

Figure shows the CSV file data types developed in this chapter. Other types will be added in later chapters. The Enhancements and Alternatives section at the end of this chapter describes how to add new data types. Figure shows the type of the data as it appears in the CSV file, the corresponding schema language data type as used in the XML representation, and comments about the data type.

CSV File Data Types

CSV Data Type

Grammar Data Type Code

Schema Language Data Type

Comments

Alphanumeric

AN

string

When converting from CSV to XML, leading and trailing spaces, tabs, line feeds, and carriage returns are trimmed. (Actually, any character with an integer value less than or equal to a space character is trimmed.) All other white space within the string is preserved.

Real number

R

decimal

When converting from CSV to XML, leading zeroes and whitespace are trimmed. Leading minus signs are preserved, and leading plus signs are removed. The code currently supports only the period character (.) as the decimal point.

Date in MM/ DD/YYYY format

DMMsDDsYYYY

date

Month and day may be either one or two digits each.

Example File Description Documents

Here are the file description documents for the invoice and purchase order examples.

Sample InvoiceCSVSourceDescription.xml
<?xml version="1.0" encoding="UTF-8"?>
<CSVSourceFileDescription
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="CSVSourceFileDescription.xsd">
<PhysicalCharacteristics>
    <RecordTerminator value="W"/>
    <ColumnDelimiter value=","/>
    <TextDelimiter value="&quot;"/>
  </PhysicalCharacteristics>
  <XMLOutputCharacteristics>
    <DocumentBreakColumn value="2"/>
    <PartnerBreakColumn value="1"/>
    <SchemaLocationURL value="CSVInvoice.xsd"/>
  </XMLOutputCharacteristics>
  <Grammar ElementName="Invoice">
    <RowDescription ElementName="InvoiceLine">
      <ColumnDescription FieldNumber="1"
          ElementName="CustomerNumber" DataType="AN"/>
      <ColumnDescription FieldNumber="2"
          ElementName="InvoiceNumber" DataType="AN"/>
      <ColumnDescription FieldNumber="3"
          ElementName="InvoiceDate" DataType="DMMsDDsYYYY"/>
      <ColumnDescription FieldNumber="4"
          ElementName="PONumber" DataType="AN"/>
      <ColumnDescription FieldNumber="5"
          ElementName="DueDate" DataType="DMMsDDsYYYY"/>
      <ColumnDescription FieldNumber="6"
          ElementName="ShipToName" DataType="AN"/>
      <ColumnDescription FieldNumber="7"
          ElementName="ShipToStreet1" DataType="AN"/>
      <ColumnDescription FieldNumber="8"
          ElementName="ShipToStreet2" DataType="AN"/>
      <ColumnDescription FieldNumber="9"
          ElementName="ShipToCity" DataType="AN"/>
      <ColumnDescription FieldNumber="10"
          ElementName="ShipToStateOrProvince" DataType="AN"/>
      <ColumnDescription FieldNumber="11"
          ElementName="ShipToPostalCode" DataType="AN"/>
      <ColumnDescription FieldNumber="12"
          ElementName="ShipToCountry" DataType="AN"/>
      <ColumnDescription FieldNumber="13"
          ElementName="ItemID" DataType="AN"/>
      <ColumnDescription FieldNumber="14"
          ElementName="ItemQuantity" DataType="R"/>
      <ColumnDescription FieldNumber="15"
          ElementName="UnitPrice" DataType="R"/>
      <ColumnDescription FieldNumber="16"
          ElementName="ItemDescription" DataType="AN"/>
      <ColumnDescription FieldNumber="17"
          ElementName="ExtendedPrice" DataType="R"/>
    </RowDescription>
  </Grammar>
</CSVSourceFileDescription>
Sample PurchaseOrderCSVTargetDescription.xml
<?xml version="1.0" encoding="UTF-8"?>
<CSVTargetFileDescription
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="CSVTargetFileDescription.xsd">
  <PhysicalCharacteristics>
    <RecordTerminator value="U"/>
    <ColumnDelimiter value=","/>
    <TextDelimiter value="&quot;"/>
  </PhysicalCharacteristics>
  <Grammar ElementName="PurchaseOrder">
    <RowDescription ElementName="POLine">
      <ColumnDescription FieldNumber="1"
          ElementName="CustomerNumber" DataType="AN"/>
      <ColumnDescription FieldNumber="2"
          ElementName="PONumber" DataType="AN"/>
      <ColumnDescription FieldNumber="3"
          ElementName="PODate" DataType="DMMsDDsYYYY"/>
      <ColumnDescription FieldNumber="4"
          ElementName="RequestedDeliveryDate"
          DataType="DMMsDDsYYYY"/>
      <ColumnDescription FieldNumber="5"
          ElementName="ShipToName" DataType="AN"
          DelimitText="true"/>
      <ColumnDescription FieldNumber="6"
          ElementName="ShipToStreet1" DataType="AN"
          DelimitText="true"/>
      <ColumnDescription FieldNumber="7"
          ElementName="ShipToStreet2" DataType="AN"
          DelimitText="true"/>
      <ColumnDescription FieldNumber="8"
          ElementName="ShipToCity" DataType="AN"
          DelimitText="true"/>
      <ColumnDescription FieldNumber="9"
          ElementName="ShipToStateOrProvince" DataType="AN"/>
      <ColumnDescription FieldNumber="10"
          ElementName="ShipToPostalCode" DataType="AN"/>
      <ColumnDescription FieldNumber="11"
          ElementName="ShipToCountry" DataType="AN"/>
      <ColumnDescription FieldNumber="12"
          ElementName="ItemID" DataType="AN"/>
      <ColumnDescription FieldNumber="13"
          ElementName="OrderedQty" DataType="R"/>
      <ColumnDescription FieldNumber="14"
          ElementName="UnitPrice" DataType="R"/>
      <ColumnDescription FieldNumber="15"
          ElementName="ItemDescription" DataType="AN"
          DelimitText="true"/>
    </RowDescription>
  </Grammar>
</CSVTargetFileDescription>

Note that in these documents and in the associated schemas, the URLs for the schema locations are all relative. They specify only the file name and not the full path location. So, a processor would expect these to all reside in the same path. I'll follow this convention in this and the next two chapters.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows