CSV to XML: Functionality and Operation





CSV to XML: Functionality and Operation

Requirements

The general requirement for this utility is to convert a CSV file containing one or more logical documents to one or more XML instance documents, each representing a single logical document. This is a significant improvement over the utility presented in Chapter 3, which was restricted to handling a single logical document. In addition, many of the restrictions on both the CSV file format and the grammar of the resulting XML document are removed. Here's a summary of the required functionality.

  • Inputs: A CSV file in uniform Row/Column organization, that is, each row has the same column entries. Records are of variable length, and columns are separated by a specified terminator. Column contents may be delimited by a text delimiter character in the first and last characters of the column. The input CSV file may consist of more than one logical document. The second input is an XML file description document (as discussed in Chapter 6) describing the CSV file and the grammar of the XML document to be produced.

  • Processing: Each input CSV row is written to an Element whose name is derived from the file description document. Each column in the row is written to a child Element, with the name also derived from the file description document. Column content is converted to Schema data types as specified in the file description document. Empty columns do not create Elements in the output document. Processing breaks occur on a new logical document and optionally on a new trading partner.

  • Outputs: One or more XML instance documents, each in a single file. The root Element name is derived from the grammar in the file description document. The file name is formed by appending a three-digit sequence number to the root Element name and adding the extension .xml. If break on trading partner has been specified, the documents for each trading partner are placed in a separate subdirectory. The subdirectories are named according to the trading partner IDs in the Partner Break column.

It may be fairly obvious why we want to create a different XML document for each logical document in the input CSV file. However, it might be helpful to review why we might want to segregate the XML documents for different trading partners into different directories. Despite the proliferation of "standards" (remember the old line, "I love standards, there are so many to choose from"), it is still customary for different trading partners to require equivalent business data to be processed using different formats. Even when they choose, for example, the same version of a UBL Order, they may use it differently. Given that, we might want to apply different XSLT transformations for the documents we send to different trading partners. Even if the documents are exactly the same format (which is unlikely), we might still want to segregate instance documents by destination since each trading partner has a unique network destination and perhaps uses different file transfer protocols, security measures, and so on. Separate directories facilitate this segregation.

Running the Utility

This section provides instructions for running the revised CSV to XML Converter utility from the command line.

For Java:

java CSVToXML InputFile.CSV OutputDirectory FileDescription.XML

or

java CSVToXML -h

For C++ on Win32:

CSVToXML InputFile.CSV OutputDirectory FileDescription.XML

or

CSVToXML -h

Options follow the parameters except for the help option, which may be specified by itself.

Parameters:

  • First: File specification of the input CSV file (required). The specification may include the full or relative path name. If no path name is specified, the file is assumed to reside in the current working directory. The full file name must be specified, but there is no restriction on the extension name.

  • Second: Path specification of the output directory (required). The directory must exist. Either a relative or an absolute path name may be specified. The trailing directory separator character is optional. If no break on trading partner is specified, all the created XML files are placed in this directory. If break on trading partner has been specified, then a subdirectory for each trading partner is created beneath this directory.

  • Third: File specification of the File Description XML instance document (required). If no path name is specified, the file is assumed to reside in the current working directory. The full file name must be specified, but there is no restriction on the extension name.

Options:

  • -v (Validate): Validate the created XML documents before writing them to disk. The documents are validated against the schema specified in the file description document.

  • -h (Help): Display a help message and exit without further processing.

Restrictions:

Unless otherwise noted, all numeric limits may be modified by changing parameters in the program source and appropriate type definitions in the file description document schemas.

  • Column delimiter: Any character, not just a comma, may be specified as the column delimiter. However, the selected column delimiter may not appear in a column's contents unless the column's contents are delimited by the specified text delimiter.

  • Text delimiter: Any character, not just a quotation mark, may be specified as the text delimiter. When used the text delimiter must be the first and last characters in the column. In all other cases it is treated as column content. If a character other than the column delimiter appears after the second text delimiter in a column, a parsing error occurs.

  • A column may have a maximum of 1,023 characters.

  • A row may be no longer than 16,383 characters.

  • A maximum of 99 columns per row is supported.

  • There is no absolute limit on the number of rows; the number is only practically limited by system memory.

  • Each column must be assigned a unique Element name.

  • Column Grammar Elements must be specified in ascending order by column number.

  • Element names are limited to 127 characters.

  • Path lengths for complete file specifications are limited to 127 characters.

  • Schema location URIs are limited to 127 characters.

  • A maximum of 999 output XML documents from an input CSV file is supported.

  • A maximum of 100 different trading partner destinations in an input CSV file is supported.

Sample Input and Output: Invoice

For our sample CSV file in this chapter we use a document commonly exported in electronic commerce situations from small businesses to larger customers that mandate that documents be exchanged electronically. We use a simple invoice. In this example, each row exported from a desktop bookkeeping and order management system has the same format. Each row represents one item to be invoiced, and we want all the rows for a specific invoice number to be grouped together into one invoice document. In addition, we want the invoices destined for different trading partners to be separated from each other, based on the customer ID column.

Our sample business is Big Daddy's Gourmet Cocoa, a specialty manufacturer of instant hot chocolate beverages that sells primarily to grocery chains. All the columns in our example invoice (Figure) are required except for the second line of the Ship to Street address (column 8) and the Ship to Country (column 12), which is assumed to be the same as Big Daddy's country (in this case, the United States).

Shown below is the sample input invoice that corresponds to the CSV format in Figure. (Note: Line breaks and indentation have been inserted for readability.)

Sample Input CSV File (Invoices.csv)
BQ003,2002041,11/12/2002,AZ999345,12/12/2002,
 "Yazoo Grocers - NE Distribution Center",
 "12 Industrial Parkway, NW","",Portland,ME,04101,,
 HCVAN,12,2.59,"Instant Hot Cocoa Mix - Vanilla flavor",31.08
BQ003,2002041,11/12/2002,AZ999345,12/12/2002,
 "Yazoo Grocers - NE Distribution Center",
 "12 Industrial Parkway, NW","",Portland,ME,04101,,
 HCMIN,24,2.53,"Instant Hot Cocoa Mix - Mint flavor",60.72
BQ003,2002042,11/12/2002,AW999346,12/12/2002,
 "Yazoo Grocers - SE Distribution Center",
 "Dock 37","3975 Hwy 75",Atoka,OK,74525,,
 HCVAN,36,2.59,"Instant Hot Cocoa Mix - Vanilla flavor",93.24
BQ003,2002042,11/12/2002,AW999346,12/12/2002,
 "Yazoo Grocers - SE Distribution Center",
 "Dock 37","3975 Hwy 75",Atoka,OK,74525,,
 HCMIN,72,2.53,"Instant Hot Cocoa Mix - Mint flavor",182.16
AY001,2002043,11/12/2002,2002-0967,12/12/2002,
 "Corner Drug and Sundries",
 "14 Main Street","",Wichita,KS,67201,,
 HCVAN,24,2.59,"Instant Hot Cocoa Mix - Vanilla flavor",62.16
BR095,2002044,11/12/2002,4397-0498,12/12/2002,
 "Big Box Discounters - Store # 97",
 "37 MegaMall","",Azusa,CA,91702,,
 HCMIN,120,2.53,"Instant Hot Cocoa Mix - Mint flavor",303.60
BR095,2002044,11/12/2002,4397-0498,12/12/2002,
 "Big Box Discounters - Store # 97",
 "37 MegaMall","",Azusa,CA,91702,,
 HCVAN,360,2.59,"Instant Hot Cocoa Mix - Vanilla flavor",932.40
BR095,2002044,11/12/2002,4397-0498,12/12/2002,
 "Big Box Discounters - Store # 97",
 "37 MegaMall","",Azusa,CA,91702,,
 HCDUC,240,2.59,"Instant Hot Cocoa Mix - Dutch Chocolate flavor",621.60
BR095,2002045,11/12/2002,4345-0498,12/12/2002,
 "Big Box Discounters - Store # 45",
 "45 Highway 76","",Branson,MO,65615,,
 HCMIN,72,2.53,"Instant Hot Cocoa Mix - Mint flavor",182.16
BR095,2002045,11/12/2002,4345-0498,12/12/2002,
 "Big Box Discounters - Store # 45",
 "45 Highway 76","",Branson,MO,65615,,
 HCDUC,96,2.59,"Instant Hot Cocoa Mix - Dutch Chocolate flavor",248.64
DQ349,2002046,11/12/2002,987-43671,12/12/2002,
 "Maple Leaf Grocers - DC #1",
 "987 Yorkland Blvd","",Willowdale,ON,M2J 4Y8,CAN,
 HCMOC,360,2.69,"Instant Hot Cocoa Mix - Mocha flavor",968.40

Logical Layout for the Invoice

Column Number

Column Name

Data Type

Description

1

Customer Number

Alphanumeric

Identifier we have assigned to the customer in our system

2

Invoice Number

Alphanumeric

System-assigned invoice number

3

Invoice Date

Date

Date of invoice, formatted MM/DD/YYYY

4

PO Number

Alphanumeric

Customer purchase order number

5

Due Date

Date

Date that the invoice amount is due for payment

6

Ship to Name

Alphanumeric

Name of the receiving location for the shipped order

7

Ship to Street 1

Alphanumeric

First address line of the receiving location

8

Ship to Street 2

Alphanumeric

Second address line of the receiving location

9

Ship to City

Alphanumeric

City of the receiving location

10

Ship to State or Province

Alphanumeric

State or province of the receiving location

11

Ship to Postal Code

Alphanumeric

Postal code of the receiving location

12

Ship to Country

Alphanumeric

Country of the receiving location

13

Item ID

Alphanumeric

Our identifier for the ordered item

14

Item Invoiced Quantity

Decimal number

Number of units shipped

15

Item Unit Price

Decimal number

Unit price in U.S. dollars

16

Item Description

Alphanumeric

Description of the ordered item

17

Extended Amount Due

Decimal number

Total amount due for the invoiced item, in U.S. dollars (unit price multiplied by the quantity invoiced)

Successful processing should produce XML documents that look like the following. (I show only the first three.)

Sample Output Documents
<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="CSVInvoice.xsd">
  <InvoiceLine>
    <CustomerNumber>BQ003</CustomerNumber>
    <InvoiceNumber>2002041</InvoiceNumber>
    <InvoiceDate>2002-11-12</InvoiceDate>
    <PONumber>AZ999345</PONumber>
    <DueDate>2002-12-12</DueDate>
    <ShipToName>Yazoo Grocers - NE Distribution Center
        </ShipToName>
    <ShipToStreet1>12 Industrial Parkway, NW</ShipToStreet1>
    <ShipToCity>Portland</ShipToCity>
    <ShipToStateOrProvince>ME</ShipToStateOrProvince>
    <ShipToPostalCode>04101</ShipToPostalCode>
    <ItemID>HCVAN</ItemID>
    <ItemQuantity>12</ItemQuantity>
    <UnitPrice>2.59</UnitPrice>
    <ItemDescription>Instant Hot Cocoa Mix - Vanilla flavor
        </ItemDescription>
    <ExtendedPrice>31.08</ExtendedPrice>
  </InvoiceLine>
  <InvoiceLine>
    <CustomerNumber>BQ003</CustomerNumber>
    <InvoiceNumber>2002041</InvoiceNumber>
    <InvoiceDate>2002-11-12</InvoiceDate>
    <PONumber>AZ999345</PONumber>
    <DueDate>2002-12-12</DueDate>
    <ShipToName>Yazoo Grocers - NE Distribution Center
        </ShipToName>
    <ShipToStreet1>12 Industrial Parkway, NW</ShipToStreet1>
    <ShipToCity>Portland</ShipToCity>
    <ShipToStateOrProvince>ME</ShipToStateOrProvince>
    <ShipToPostalCode>04101</ShipToPostalCode>
    <ItemID>HCMIN</ItemID>
    <ItemQuantity>24</ItemQuantity>
    <UnitPrice>2.53</UnitPrice>
    <ItemDescription>Instant Hot Cocoa Mix - Mint flavor
        </ItemDescription>
    <ExtendedPrice>60.72</ExtendedPrice>
  </InvoiceLine>
</Invoice>

<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="CSVInvoice.xsd">
  <InvoiceLine>
    <CustomerNumber>BQ003</CustomerNumber>
    <InvoiceNumber>2002042</InvoiceNumber>
    <InvoiceDate>2002-11-12</InvoiceDate>
    <PONumber>AW999346</PONumber>
    <DueDate>2002-12-12</DueDate>
    <ShipToName>Yazoo Grocers - SE Distribution Center
        </ShipToName>
    <ShipToStreet1>Dock 37</ShipToStreet1>
    <ShipToStreet2>3975 Hwy 75</ShipToStreet2>
    <ShipToCity>Atoka</ShipToCity>
    <ShipToStateOrProvince>OK</ShipToStateOrProvince>
    <ShipToPostalCode>74525</ShipToPostalCode>
    <ItemID>HCVAN</ItemID>
    <ItemQuantity>36</ItemQuantity>
    <UnitPrice>2.59</UnitPrice>
    <ItemDescription>Instant Hot Cocoa Mix - Vanilla flavor
        </ItemDescription>
    <ExtendedPrice>93.24</ExtendedPrice>
  </InvoiceLine>
  <InvoiceLine>
    <CustomerNumber>BQ003</CustomerNumber>
    <InvoiceNumber>2002042</InvoiceNumber>
    <InvoiceDate>2002-11-12</InvoiceDate>
    <PONumber>AW999346</PONumber>
    <DueDate>2002-12-12</DueDate>
    <ShipToName>Yazoo Grocers - SE Distribution Center
        </ShipToName>
    <ShipToStreet1>Dock 37</ShipToStreet1>
    <ShipToStreet2>3975 Hwy 75</ShipToStreet2>
    <ShipToCity>Atoka</ShipToCity>
    <ShipToStateOrProvince>OK</ShipToStateOrProvince>
    <ShipToPostalCode>74525</ShipToPostalCode>
    <ItemID>HCMIN</ItemID>
    <ItemQuantity>72</ItemQuantity>
    <UnitPrice>2.53</UnitPrice>
    <ItemDescription>Instant Hot Cocoa Mix - Mint flavor
        </ItemDescription>
    <ExtendedPrice>182.16</ExtendedPrice>
  </InvoiceLine>
</Invoice>

<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="CSVInvoice.xsd">
  <InvoiceLine>
    <CustomerNumber>AY001</CustomerNumber>
    <InvoiceNumber>2002043</InvoiceNumber>
    <InvoiceDate>2002-11-12</InvoiceDate>
    <PONumber>2002-0967</PONumber>
    <DueDate>2002-12-12</DueDate>
    <ShipToName>Corner Drug and Sundries</ShipToName>
    <ShipToStreet1>14 Main Street</ShipToStreet1>
    <ShipToCity>Wichita</ShipToCity>
    <ShipToStateOrProvince>KS</ShipToStateOrProvince>
    <ShipToPostalCode>67201</ShipToPostalCode>
    <ItemID>HCVAN</ItemID>
    <ItemQuantity>24</ItemQuantity>
    <UnitPrice>2.59</UnitPrice>
    <ItemDescription>Instant Hot Cocoa Mix - Vanilla flavor
        </ItemDescription>
    <ExtendedPrice>62.16</ExtendedPrice>
  </InvoiceLine>
</Invoice>

     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows