Guessing a Document's Encoding






Guessing a Document's Encoding

Credit: Mauro Cicio

Problem

You want to know the character encoding of a document that doesn't declare it explicitly.

Solution

Use the Ruby bindings to the libcharguess library. Once it's installed, using libcharguess is very simple.

Here's an XML document written in Italian, with no explicit encoding:

	doc = %{<?xml version="1.0"?>
	     <menu tipo="specialità" giorno="venerdì">
	    <primo_piatto>spaghetti al ragù</primo_piatto>
	        <bevanda>frappè</bevanda>
	     </menu>}

Let's find its encoding:

	require 'charguess'

	CharGuess::guess doc
	# => "windows-1252"

This is a pretty good guess: the XML is written in the ISO-8859-1 encoding, and many web browsers treat ISO-8859-1 as Windows-1252.

Discussion

In XML, the character-encoding indication is optional, and may be provided as an attribute of the XML declaration in the first line of the document:

	<xml version="1.0" encoding="utf-8"?>

If this is missing, you must guess the document encoding to process the document. You can assume the lowest common denominator for your community (usually this means assuming that everything is either UTF-8 or ISO-8859-1), or you can use a library that examines the document and uses heuristics to guess the encoding.

As of the time of writing, there are no pure Ruby libraries for guessing the encoding of a document. Fortunately, there is a small Ruby wrapper around the Charguess library. This library can guess with 95% accuracy the encoding of any text whose charset is one of the following: BIG5, HZ, JIS, SJIS, EUC-JP, EUC-KR, EUC-TW, GB2312, Bulgarian, Cyrillic, Greek, Hungarian, Thai, Latin1, and UTF8.

Note that Charguess is not XML-or HTML-specific. In fact, it can guess the encoding of an arbitrary string:

	CharGuess::guess("\xA4\xCF")               # => "EUC-JP"

It's fairly easy to install libcharguess, since the library is written in portable C++. Unfortunately, it doesn't take care to put its header files in a standard location. This makes it a little tricky to compile the Ruby bindings, which depend on the charguess.h header. When you run extconf.rb to prepare the bindings, you must explicitly tell the script where to find libcharguess's headers. Here's how you might compile the Ruby bindings to libcharguess:

	$ ruby extconf.rb --with-charguess-include=/location/of/charguess.h
	$ make
	$ make install

See Also



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows