Converting from One Encoding to Another






Converting from One Encoding to Another

Credit: Mauro Cicio

Problem

You want to convert a document to a given charset encoding (probably UTF-8).

Solution

If you don't know the document's current encoding, you can guess at it using the Charguess library described in the previous recipe. Once you know the current encoding, you can convert the document to another encoding using Ruby's standard iconv library.

Here's an XML document written in Italian, with no explicit encoding:

	doc = %{<?xml version="1.0"?>
	     <menu tipo="specialità" giorno="venerdì">
	    <primo_piatto>spaghetti al ragù</primo_piatto>
	       <bevanda>frappè</bevanda>
	     </menu>}

Let's figure out its encoding and convert it to UTF-8:

	require 'iconv'
	require 'charguess' # not necessary if input encoding is known

	input_encoding = CharGuess::guess doc                 # => "windows-1252"
	output_encoding = 'utf-8'

	converted_doc = Iconv.new(output_encoding, input_encoding).iconv(doc)

	CharGuess::guess(converted_doc) 	                  # => "UTF-8"

Discussion

The heart of the iconv library is the Iconv class, a wrapper for the Unix 95 iconv( ) family of functions. These functions translate strings between various encoding systems. Since iconv is part of the Ruby standard library, it should be already available on your system.

Iconv works well in conjunction with Charguess: even if Charguess guesses the encoding a little bit wrong (such as guessing Windows-1252 for an ISO-8859-1 document), it always makes a good enough guess that iconv can convert the document to another encoding.

Like Charguess, the Iconv library is not XML-or HTML-specific. You can use libcharguess and iconv together to convert an arbitrary string to a given encoding.

See Also



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows