Quoted Printable

Quoted Printable

The quoted printable encoding was designed for e-mail, but it works surprisingly well in XML. It can encode any ASCII or UTF-8 text including characters that are not normally allowed in XML documents, such as form feed, vertical tab, null, and the reserved XML characters less than sign (<), greater than sign (>), and ampersand (&). Most importantly, most of the ASCII text is intact and can be read by a normal browser. Only the problematic characters need to be escaped.

The basic algorithm uses just five rules.

  1. Each byte can be encoded as an equals sign (=) followed by two hexadecimal digits; for example, a less than sign can be represented as =3C. Only capital letters are used for the hexadecimal digits A–F.

  2. The bytes with values between 33 and 60, inclusive, and 62 through 126, inclusive, may be represented as the ASCII characters. However, for use with XML we need to modify that a bit and always encode 60 (the less than sign) and 62 (the greater than sign) as =3C and =3E, respectively. The ampersand must be encoded as =26. Note that it is always acceptable to encode characters in this form even if you don't have to.

  3. The ASCII tab (0x09) and ASCII space (0x20) can be written literally except at the end of a line, where they must be escaped as =09 and =20, respectively.

  4. All line breaks are replaced by carriage return–line feed pairs.

  5. Each line can be no be longer than 76 characters. You can indicate that a line should be continued on the next line by adding a single equals sign at the very end of the line If so, nothing can follow this, not even an extra space. The equals sign must be the very last character on the line.

For example, suppose you're extracting text from a database that allows nulls, form feeds, and other C0 controls in the strings. You might, for instance, have a field like this:

"Wong, Gao Yin (ACF)"\00<[email protected]>\0C

(Naturally, I can't print the nonprinting control characters in this book, so they're represented above as a backslash followed by two hexadecimal bytes.)

Leaving aside for a second the fact that this field violates just about every normalization rule in the book, it could easily be encoded in quoted printable like this when exported to XML:

"Wong, Gao Yin (ACF)"[email protected]=3E=0C

Of course, most of your data will not need to be so encoded. The advantage to quoted printable encoding is that it leaves most characters intact most of the time. The result may not be perfectly legible as ASCII, but it's normally possible to make sense out of it.

Most major environments have quoted printable encoders and decoders hiding somewhere, often in mail APIs, so you don't have to roll your own. For instance, in Java the javax.mail.internet.MimeUtility class from the Java Mail API, a standard extension to Java, can encode and decode quoted printable strings, as well as several other encodings. Perl has the MIME::QuotedPrint module. Python has a mimetools module with encode and decode functions. Microsoft hasn't added this functionality to the standard .NET library yet, but several third-party libraries are available that implement this algorithm. If you use these instead of writing your own (which isn't hard), you'll also need to further escape the less than, greater than, and ampersand characters because the standard encoders likely won't do this for you. (The standard decoders should have no trouble reversing the process though.)

When should you not use quoted printable? I can think of two situations.

  1. When the only problematic characters are the reserved characters (less than sign, greater than sign, and ampersand). In these cases you're better off using entity references like &lt; or CDATA sections instead. Quoted printable becomes important when the data is likely to contain control characters like null and vertical tab.

  2. When the data is pure binary rather than text, such as a JPEG image. In this case, Base64 will be smaller and no more opaque.

     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows