Characters





Characters

The SQL Standard provides four data types for columns that contain character string data: CHAR, VARCHAR, NATIONAL CHARACTER (or NCHAR), and NATIONAL CHARACTER VARYING (or NCHAR VARYING). DBMSs that support these data types may set their own maximum lengths for each. Figure shows the SQL Standard requirements and the level of support (data type and maximum size in characters or bytes) the Big Eight have for character data types, while Figure shows how the SQL Standard and the DBMSs treat trailing spaces and data truncation for character columns.

Notes on Figure:

  • The figures in Figure show the maximum-possible defined length of an SQL character column. For example, the largest CHAR column possible with Ingres is CHAR(2000), which lets you insert strings up to 2,000 characters long, the largest possible VARCHAR column with MySQL is VARCHAR(255), which lets you insert strings up to 255 characters long, and so on. The maximum length often depends on the page size, as with Sybase. The SQL Standard lets vendors set the maximum sizes of each data type.

    ANSI/DBMS Support for Character Data Types
      CHAR VARCHAR NCHAR NCHAR VARYING Other
    ANSI SQL Yes Yes Yes Yes No
    IBM 254 bytes 32KB No No No
    Informix 32KB 255 bytes 32KB 255 bytes 2KB
    Ingres 2KB 2KB No No No
    InterBase 32KB 32KB 32KB 32KB No
    Microsoft 8KB 8KB 8KB 8KB No
    MySQL 255 bytes 255 bytes 255 bytes 255 bytes 64KB
    Oracle 2KB 4KB 2KB 4KB No
    Sybase 16KB 16KB 16KB 16KB No

  • The defined length and the length in bytes is the same if—and only if—the character size is one byte. This, of course, depends on the character set being used.

  • Informix calls the NCHAR VARYING data type NVARCHAR. Informix also supports LVARCHAR, for variable-length character data up to 2KB long.

  • MySQL also supports BLOB (for case-sensitive character values larger than the CHAR/VARCHAR maximum) and TEXT (for case-insensitive character values larger than the CHAR/VARCHAR maximum). BLOB and TEXT maxima are 65,535 characters.

Notes on Figure:

  • Strips Trailing Space from VARCHAR column

    This column is "No" if the DBMS follows the SQL Standard and does not strip trailing spaces from a string assigned to a variable-length column.

    ANSI/DBMS Support for Character String Operations
      Strips Trailing Space from VARCHAR Warning on Truncate
    ANSI SQL No Yes
    IBM No Yes
    Informix No No
    Ingres No Yes
    InterBase No Yes
    Microsoft No Yes
    MySQL Yes No
    Oracle No Yes
    Sybase Yes No

    • Informix does not strip trailing spaces on INSERT, but it does strip them when you SELECT from a VARCHAR column; not a very useful trait.

  • Warning on Truncate column

    This column is "Yes" if the DBMS follows the SQL Standard and returns a truncation warning or error if you assign a too-long string to a character column.

    • Sybase will provide a truncation error if SET STRING_RTRUNCATION is on.

Because we're trying to determine whether it's best to define columns with a fixed-size data type or a variable-length data type, we'll ignore the distinctions between, say, CHAR and NCHAR and concentrate on the question—Should CHAR or VARCHAR be used to define columns? The main difference between the two is that CHAR is fixed-size while VARCHAR is variable-length—a consideration we've already dealt with. But there are five other differences to consider as well.

One: Changes to maximum size

Until recently, VARCHAR's maximum length was often shorter than CHAR's. For example, in Microsoft 6.5, a VARCHAR column couldn't be longer than 255 characters. If you're upgrading an old application, it's time to see whether more columns should be VARCHAR now.

Two: Trailing spaces

If you insert this string:


'X'

into a CHAR(5) column, the DBMS will pad with four trailing spaces and store this string:


'X    '

On the other hand, if you insert:


'X  '

into a VARCHAR(5) column, an ANSI SQL DBMS will keep the trailing spaces and store:


'X  '

That is, a VARCHAR column does not lose meaningful information about trailing spaces. However, several DBMSs are not SQL Standard-compliant (see Figure). With those DBMSs, it's impossible to determine how many spaces were inserted originally.

Three: Concatenation

Because of the space padding requirement, concatenating CHAR columns involves extra work. For example, suppose you have two columns, forename and surname, each defined as CHAR(20). The concatenation of forename with surname, for example:


SELECT forename || surname

  FROM Table1 ...

would result in a string that looks like this (except with MySQL and Sybase):


'Paul             Larue        '

Probably not the result you're looking for! To get the DBMS to return the logical result:


'Paul Larue'

you'd need to amend your query to:


SELECT TRIM(forename) || ' ' || TRIM(surname)

   FROM Table1 ...

Four: Data type conversion

CHAR and VARCHAR are two different data types, so expressions of the form char_column = varchar_column involve data type conversions. Some DBMSs won't use indexes if conversions are necessary, so be consistent when defining columns that might be compared or joined.

Five: Truncation

If you make a mistake and insert ABCDE into a CHAR(4) column, don't expect to see an error or warning message. Some DBMSs will silently truncate your string to ABCD and you'll lose possibly vital data (see Figure).

Length Specification

Whether you choose to use CHAR or VARCHAR, you'll also have to decide on a length specification for the column. When defining length, most people err on the side of caution. For example, they find the longest value in the original input data and then double that value for the resulting column length. Here are some length-specification hints, relevant to columns that contain names:

  • The longest name in the Bible is Maher-shalal-hash-baz (21 characters; Isaiah 8:1).

  • The longest legal URL is 63 characters.

  • The longest taxonomic name, Archaeosphaerodiniopsidaceae, has 28 characters. The famous long words antidisestablishmentarianism and floccinaucinihilipilification have 28 and 29 characters, respectively.

  • The names of American states and Canadian provinces can be abbreviated to two-letter codes, and the names of countries can be abbreviated using the three-letter ISO codes for nations. (It's best not to use the ISO two-letter nation codes because of duplication that could cause confusion. For example, the two-letter ISO code for the Cayman Islands is KY, which is also the abbreviation for Kentucky.)

  • That famous long Welsh name, Llanfairpwllgwyngyllgogerychwyrndrobwll Llantysiliogogogoch, can be abbreviated to Llanfair PG.

Variant Character Sets

You can use CHARACTER SET clauses to support variant character sets, but it's more convenient to use this convention:

  • Use CHAR/VARCHAR for the basic Western European character set (variously called ISO 8859-1 or Windows 1252 or Latin1).

  • Use NCHAR/NCHAR VARYING for the alternate national-language set.

The obvious alternate NCHAR/NCHAR VARYING set is Unicode, and that's what you'll always get with Microsoft, Oracle, and Sybase. Unicode is the Java default, so conversions to/from Java host variables are easy, but for most ODBC and ActiveX Data Objects (ADO) drivers, there's extra overhead. Here are some other Unicode considerations:

  • There are actually two different Unicode sets. One uses two bytes for each character; the other uses up to three bytes per character.

  • With Intel processors, 16-bit operations are penalized. For example it takes about three times longer to compare two 16-bit values than to compare two 8-bit values.

  • Responsible DBMSs will use Unicode for system table names.

  • There is no guarantee that you can use collation options with Unicode.

Speaking of collation—the fastest collation is binary. A binary collation is usually managed by either defining a column with a FOR BIT DATA/BINARY attribute or by forcing a binary collation in some other way—preferably with the ANSI SQL COLLATE clause. (For a discussion of collations, see Chapter 3, "ORDER BY.") You should not, however, use a binary collation if there are Unicode characters and the storage is big-endian.

If you do use a national-language character set, you'll be happy to know that all DBMSs except InterBase support the LOWER and UPPER functions correctly despite a strange SQL-92 requirement that only "simple Latin letters" should be converted.

The Bottom Line: Characters

Some DBMSs won't give you an error or warning message if you insert a too-large value into a character column. They'll just silently truncate, and you'll lose possibly vital data.

Use CHAR/VARCHAR for the basic Western European character set (ISO 8859-1 or Windows 1252). Use NCHAR/NCHAR VARYING for the alternate national-language set, which will often be Unicode.

In the past, VARCHAR's maximum length was often shorter than CHAR's. If you're upgrading an old application, it's time to see whether more columns should be VARCHAR now.

An SQL-compliant DBMS will pad with trailing spaces, if necessary, when inserting into a CHAR column. It will not strip trailing spaces when inserting into a VARCHAR column, so a VARCHAR column does not lose meaningful information about trailing spaces.

Because of the space padding, concatenating CHAR columns involves extra work.

Expressions of the form char_column = varchar_column involve data type conversions. Be data type consistent when defining columns that might be compared or joined.

Recommendation: Prefer VARCHAR/NCHAR VARYING for character string data.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows