April 28, 2011, 12:04 p.m.
posted by blackhat
The SQL Standard provides four data types for columns that contain character string data: CHAR, VARCHAR, NATIONAL CHARACTER (or NCHAR), and NATIONAL CHARACTER VARYING (or NCHAR VARYING). DBMSs that support these data types may set their own maximum lengths for each. Figure shows the SQL Standard requirements and the level of support (data type and maximum size in characters or bytes) the Big Eight have for character data types, while Figure shows how the SQL Standard and the DBMSs treat trailing spaces and data truncation for character columns.
Notes on Figure:
Notes on Figure:
Because we're trying to determine whether it's best to define columns with a fixed-size data type or a variable-length data type, we'll ignore the distinctions between, say, CHAR and NCHAR and concentrate on the question—Should CHAR or VARCHAR be used to define columns? The main difference between the two is that CHAR is fixed-size while VARCHAR is variable-length—a consideration we've already dealt with. But there are five other differences to consider as well.
One: Changes to maximum size
Until recently, VARCHAR's maximum length was often shorter than CHAR's. For example, in Microsoft 6.5, a VARCHAR column couldn't be longer than 255 characters. If you're upgrading an old application, it's time to see whether more columns should be VARCHAR now.
Two: Trailing spaces
If you insert this string:
into a CHAR(5) column, the DBMS will pad with four trailing spaces and store this string:
On the other hand, if you insert:
into a VARCHAR(5) column, an ANSI SQL DBMS will keep the trailing spaces and store:
That is, a VARCHAR column does not lose meaningful information about trailing spaces. However, several DBMSs are not SQL Standard-compliant (see Figure). With those DBMSs, it's impossible to determine how many spaces were inserted originally.
Because of the space padding requirement, concatenating CHAR columns involves extra work. For example, suppose you have two columns, forename and surname, each defined as CHAR(20). The concatenation of forename with surname, for example:
SELECT forename || surname FROM Table1 ...
would result in a string that looks like this (except with MySQL and Sybase):
'Paul Larue '
Probably not the result you're looking for! To get the DBMS to return the logical result:
you'd need to amend your query to:
SELECT TRIM(forename) || ' ' || TRIM(surname) FROM Table1 ...
Four: Data type conversion
CHAR and VARCHAR are two different data types, so expressions of the form char_column = varchar_column involve data type conversions. Some DBMSs won't use indexes if conversions are necessary, so be consistent when defining columns that might be compared or joined.
If you make a mistake and insert ABCDE into a CHAR(4) column, don't expect to see an error or warning message. Some DBMSs will silently truncate your string to ABCD and you'll lose possibly vital data (see Figure).
Whether you choose to use CHAR or VARCHAR, you'll also have to decide on a length specification for the column. When defining length, most people err on the side of caution. For example, they find the longest value in the original input data and then double that value for the resulting column length. Here are some length-specification hints, relevant to columns that contain names:
Variant Character Sets
You can use CHARACTER SET clauses to support variant character sets, but it's more convenient to use this convention:
The obvious alternate NCHAR/NCHAR VARYING set is Unicode, and that's what you'll always get with Microsoft, Oracle, and Sybase. Unicode is the Java default, so conversions to/from Java host variables are easy, but for most ODBC and ActiveX Data Objects (ADO) drivers, there's extra overhead. Here are some other Unicode considerations:
Speaking of collation—the fastest collation is binary. A binary collation is usually managed by either defining a column with a FOR BIT DATA/BINARY attribute or by forcing a binary collation in some other way—preferably with the ANSI SQL COLLATE clause. (For a discussion of collations, see Chapter 3, "ORDER BY.") You should not, however, use a binary collation if there are Unicode characters and the storage is big-endian.
If you do use a national-language character set, you'll be happy to know that all DBMSs except InterBase support the LOWER and UPPER functions correctly despite a strange SQL-92 requirement that only "simple Latin letters" should be converted.
The Bottom Line: Characters
Some DBMSs won't give you an error or warning message if you insert a too-large value into a character column. They'll just silently truncate, and you'll lose possibly vital data.
Use CHAR/VARCHAR for the basic Western European character set (ISO 8859-1 or Windows 1252). Use NCHAR/NCHAR VARYING for the alternate national-language set, which will often be Unicode.
In the past, VARCHAR's maximum length was often shorter than CHAR's. If you're upgrading an old application, it's time to see whether more columns should be VARCHAR now.
An SQL-compliant DBMS will pad with trailing spaces, if necessary, when inserting into a CHAR column. It will not strip trailing spaces when inserting into a VARCHAR column, so a VARCHAR column does not lose meaningful information about trailing spaces.
Because of the space padding, concatenating CHAR columns involves extra work.
Expressions of the form char_column = varchar_column involve data type conversions. Be data type consistent when defining columns that might be compared or joined.
Recommendation: Prefer VARCHAR/NCHAR VARYING for character string data.