Jan. 14, 2011, 3:15 a.m.
posted by randomuser
The quoted printable encoding was designed for e-mail, but it works surprisingly well in XML. It can encode any ASCII or UTF-8 text including characters that are not normally allowed in XML documents, such as form feed, vertical tab, null, and the reserved XML characters less than sign (<), greater than sign (>), and ampersand (&). Most importantly, most of the ASCII text is intact and can be read by a normal browser. Only the problematic characters need to be escaped.
The basic algorithm uses just five rules.
For example, suppose you're extracting text from a database that allows nulls, form feeds, and other C0 controls in the strings. You might, for instance, have a field like this:
"Wong, Gao Yin (ACF)"\00<[email protected]>\0C
(Naturally, I can't print the nonprinting control characters in this book, so they're represented above as a backslash followed by two hexadecimal bytes.)
Leaving aside for a second the fact that this field violates just about every normalization rule in the book, it could easily be encoded in quoted printable like this when exported to XML:
"Wong, Gao Yin (ACF)"[email protected]=3E=0C
Of course, most of your data will not need to be so encoded. The advantage to quoted printable encoding is that it leaves most characters intact most of the time. The result may not be perfectly legible as ASCII, but it's normally possible to make sense out of it.
Most major environments have quoted printable encoders and decoders hiding somewhere, often in mail APIs, so you don't have to roll your own. For instance, in Java the javax.mail.internet.MimeUtility class from the Java Mail API, a standard extension to Java, can encode and decode quoted printable strings, as well as several other encodings. Perl has the MIME::QuotedPrint module. Python has a mimetools module with encode and decode functions. Microsoft hasn't added this functionality to the standard .NET library yet, but several third-party libraries are available that implement this algorithm. If you use these instead of writing your own (which isn't hard), you'll also need to further escape the less than, greater than, and ampersand characters because the standard encoders likely won't do this for you. (The standard decoders should have no trouble reversing the process though.)
When should you not use quoted printable? I can think of two situations.