General Sort Considerations





General Sort Considerations

Sorting has been a subject of research for decades, and it is not difficult to find a reasonable recipe nowadays. What we expect, at a minimum, is that each of the Big Eight act this way:

  • If physical records are large and ORDER BY columns are small, DBMSs will start by extracting the essential information, loading it into memory, then performing a "tag sort" which compares only the extracted snippets. This procedure makes it unnecessary to reread a whole physical record every time a comparison is performed. The DBMS may have to do large dynamic-memory allocations during the process.

  • DBMSs will use one of the better algorithms, such as tournament sort. These algorithms often have a counter-intuitive flaw: If all the rows are already in order or nearly in order, it won't help much because the speed depends very little on the order of the input data, and very much on other factors.

All right, so what does affect sort speed? Here's the answer, in order of importance:

  • Number of rows selected

  • Number of columns in the ORDER BY clause

  • Length of the columns in the ORDER BY clause

Figures 3-1, 3-2, and 3-3 show the effect, on sort speed, of increasing each of these variables.

Effect on sort speed of increasing row count; Test #1

graphics/03fig01.gif

Effect on sort speed of increasing column count; Test #2

graphics/03fig02.gif

Effect on sort speed of increasing column length; Test #3

graphics/03fig03.gif

Testing ORDER BY speed

To test a single factor X, it's necessary to keep all other factors equal while rerunning SQL statements with different values of X. We tried. First we created a table containing CHAR columns. We populated the table using a randomizer that generated equal amounts of the letters A through G, with the result that some columns contained duplicate values or at least began with the same letters. Filler columns were added as needed to ensure that total row length was the same in all tests. Then we ran these three tests:

Test #1, result shown in Figure 3-1

Number of rows: Varying from 1,000 to 10,000

Number of columns: Fixed at one

Column length: Fixed at CHAR(10)

SQL statements: Fixed at one:


SELECT * FROM Table1 ORDER BY column1

Test #2, result shown in Figure 3-2

Number of rows: Fixed at 1,000

Number of columns: Varying from one to ten

Column length: Fixed at CHAR(10)

SQL statements: Varying from one to ten:


SELECT * FROM Table1 ORDER BY column1

SELECT * FROM Table1 ORDER BY column1, column2

SELECT * FROM Table1 ORDER BY column1, column2, column3

etc.

Test #3, result shown in Figure 3-3

Number of rows: Fixed at 1,000

Number of columns: Fixed at one

Column length: Varying from CHAR(1) to CHAR(10)

SQL statements: Fixed at one:


SELECT * FROM Table1 ORDER BY column1

The timing numbers shown in Figures 3-1, 3-2, and 3-3 are the result and are averages for the Big Eight. Tests were run on a single-processor machine using whatever memory the DBMS chose for its default configuration.

It's no surprise that an increase in the row count has a geometric effect—if you increase the number of rows by a factor of ten, the job takes about twenty times as long to complete. We had hoped for a better result. That's achievable by adding more processors and more memory, because DBMSs can sort groups of tags in parallel threads and will thresh less if all tags are in memory.

Surprisingly, an increase in the column count has a more drastic effect than an increase in the column length. Remember, total number of bytes was the same in both cases, so the effect is more than illusion. The likely explanation is that our DBMSs are using multipass algorithms, comparing only one column at a time.

The bottom line: Take drastic action to reduce row count—for example, by selecting only parts of a table at a time. Take severe action to reduce column count—for example, by concatenating two columns instead of specifying them separately. Take moderate action to reduce column length—for example, by using the SUBSTRING function. It's also slightly helpful if values are presorted and unique on the first few characters, but Christmas comes only one day a year.

We make several more statements about what happens with a sort in the following sections.

Partial duplicates slow sorts

Suppose you have a table with a CHAR column. Populate the column with data in which the first five characters are duplicates. Now SELECT:


SELECT column1 FROM Table1

  ORDER BY column1

Delete all rows and repopulate, this time with random data. SELECT again:


SELECT column1 FROM Table1

  ORDER BY column1

GAIN: 5/8

Skipping the duplicated characters made the sort faster for five DBMSs.

Presorting speeds sorts

Go back to your table with a CHAR column. Populate the column with random data and SELECT:


SELECT column1 FROM Table1

  ORDER BY column1

Delete all rows and repopulate, this time with data that's in alphabetic order and SELECT:


SELECT column1 FROM Table1

  ORDER BY column1

GAIN: 4/8

Having the rows already in order made the sort faster for half of the Big Eight.

It's the defined length that matters

Remember that a variable-length column has a defined length and an actual length. For example, a VARCHAR(30) column may contain ABC, in which case its defined length is 30 characters and its actual length is three characters. For sorting, it's the 30-character definition that matters! We tested this by varying the defined length of a VARCHAR column, while keeping the actual data length the same. The result was that Scenario #1 is slower than Scenario #2 even when the contents are the same in both cases:


Scenario #1:

CREATE TABLE Table1 (

   column1 VARCHAR(100))



SELECT column1 FROM Table1

  ORDER BY column1



Scenario #2:

CREATE TABLE Table2 (

   column1 VARCHAR(10))



SELECT column1 FROM Table2

   ORDER BY column1

GAIN: 6/8

Moral

Defining VARCHAR columns with "room to grow" degrades sorts. The gain might seem surprising, but it's consistent with the way that most sort algorithms work: They allot a fixed memory buffer in advance, and the size of the buffer depends on the maximum anticipated key size.


INTEGERs beat SMALLINTs

On Windows machines, where an INTEGER requires 32 bits and a SMALLINT requires 16 bits, the tendency may be to think that SMALLINT sorting is faster than sorting INTEGER columns. But comparisons of integers are usually faster because 32 bits is your computer's word size and DBMSs are able to take advantage of the fact. So Scenario #1 is slower than Scenario #2:


Scenario #1:

CREATE TABLE Table1 (

   column1 SMALLINT)



SELECT column1 FROM Table1

  ORDER BY column1



Scenario #2:

CREATE TABLE Table2 (

   column1 INTEGER)



SELECT column1 FROM Table2

   ORDER BY column1

GAIN: 5/8

INTEGERs beat CHARs

We sorted a CHAR(4) column, then we sorted an INTEGER column, both with random data. The INTEGER sort was often faster, because a character string cannot be compared four bytes at a time—but this is done for an integer. Thus Scenario #1 is slower than Scenario #2:


Scenario #1:

CREATE TABLE Table1 (

   column1 CHAR(4))



SELECT column1 FROM Table1

  ORDER BY column1

Scenario #2:

CREATE TABLE Table2 (

   column1 INTEGER)



SELECT column1 FROM Table2

   ORDER BY column1

GAIN: 4/8

Sets beat multisets

Technically, a multiset differs from a set because it contains duplicates. Testing against two tables with the same number of rows, but with significant numbers of duplicates in the first table, we found, once again, that sorting is faster when there are no duplicates.

Conclusion

The fastest sort is an ascending sort of an integer with unique values, presorted.

The ORDER BY Clause

A simplified description of the SELECT ... ORDER BY syntax looks like this:


SELECT <column list>                   /* the select list */

  FROM <Table list>

  ORDER BY <column expression> [ASC | DESC] [,...]

This syntax has a new wrinkle. In SQL-92, you had to ORDER BY <column name>, and the column named had to be in the select list. In SQL:1999, you're allowed to ORDER BY <column expression>. For example, to sort numbers in backwards order, you've now got these two choices:


With New SQL:1999 Wrinkle:

SELECT numeric_column

   FROM Table1

   ORDER BY numeric_column * -1



Without New Wrinkle:

SELECT numeric_column, numeric_column * -1 AS num

   FROM Table1

   ORDER BY num

Two incidental remarks about this example:

  • In theory, an SQL:1999 DBMS silently adds numeric_column * -1 to the select list just before sorting and takes it out just after sorting.

  • The result of ORDER BY numeric_column * -1 can differ slightly from the result of ORDER BY numeric_column DESC. The difference, as usual, is caused by the presence of NULLs.

Figure shows the SQL Standard requirements and the level of support the Big Eight have for ORDER BY.

Notes on Figure:

  • Max Columns column

    Shows how many columns may be listed in the ORDER BY clause.

    • For Sybase, our tests showed it was possible to sort up to 31 columns. This differs from Sybase's response to JDBC's getMaxColumnsIn Order By call, which returns 16.

  • Max Bytes column

    Shows the maximum allowed length, in bytes, of an ORDER BY clause.

    This column shows a significant number. If column1 is defined as CHAR(500) and Max Bytes shows a value of "500" then ORDER BY column1, column2 is ineffective because column2 doesn't get into the tag.

    ANSI/DBMS ORDER BY Support
      Max Columns Max Bytes NULLs Sort Sort LOBs ORDER BY Expression COLLATE Clause
    ANSI SQL N/S N/S Low or High No Yes Yes
    IBM 500 253 High No Yes No
    Informix >=1000 >=1000 Low No No No
    Ingres 300 >=1000 High No Yes No
    InterBase 254 254 At End Yes No Yes
    Microsoft >=1000 >=1000 Low No Yes Yes
    MySQL >=1000 >=1000 Low Yes Yes No
    Oracle 254 >=1000 High No Yes Yes
    Sybase 31 >=1000 Low No Yes No

  • NULLs Sort column

    Shows where the DBMS places NULLs in a sorted list. This column is "Low" if NULLs are considered less than all other values, "High" if NULLs are considered greater than all other values, and "At End" if the DBMS places NULLs at the end of a sorted list both when you ORDER BY...ASC and when you ORDER BY...DESC.

    • For InterBase, our tests showed that NULLs sort At End. This differs from InterBase's response to (a) JDBC's NullsAreSortedAtEnd call, which returns false and (b) JDBC's NullsAreSortedHigh call, which returns true.

    • For MySQL and Sybase, our tests showed that NULLs sort Low. In the first case, this differs from MySQL's response to (a) JDBC's NullsAreSortedLow call, which returns false and (b) JDBC's NullsAreSortedAtStart call, which returns true. In the second case, this differs from Sybase's response to (a) JDBC's NullsAreSortedLow call, which returns false and (b) JDBC's NullsAreSortedHigh call, which returns true.

    • For Oracle, our tests showed that NULLs sort High. This differs from Oracle's response to (a) JDBC's NullsAreSortedLow call, which returns true and (b) JDBC's NullsAreSortedHigh call, which returns false.

  • Sort LOBs column

    This column is "Yes" if the DBMS allows large object data (i.e., BLOB, CLOB, NCLOB, or the nonstandard SQL extensions TEXT, IMAGE, BINARY, GRAPHIC, etc.) to be sorted.

  • ORDER BY Expression column

    Shows whether the DBMS allows ORDER BY to contain columns or expressions that are not in the select list. This column is "Yes" if the DBMS supports SQL:1999-style statements like either of these two:

    
    SELECT column1 FROM Table1 ORDER BY column2
    
    SELECT column1 FROM Table1 ORDER BY <function>(column1)
    
    
  • COLLATE Clause column

    This column is "Yes" if the DBMS supports SQL Standard-style COLLATE clauses, or Oracle-style NLSSORT() function calls, or a CAST to a different character set with a different collation in ORDER BY, like this:

    
    SELECT column1, column2 FROM Table1
    
      ORDER BY column1 COLLATE SQL_Latin1_General
    
    

For many DBMSs, NULLs sort Low—that is, NULLs are considered to be less than the smallest non-NULL value. For IBM, Ingres, and Oracle, NULLs sort High—that is, NULLs are considered to be greater than the largest non-NULL value. For InterBase and PostgreSQL, NULLs sort At the End—whether you're sorting in ascending or in descending order. That means the set of values {-1, +1, NULL} sorts four different ways depending on DBMS and depending on your sort order:

  • {NULL, –1, +1}

    /* result after ORDER BY column1 for many DBMSs, including Informix, Microsoft, MySQL, and Sybase */

  • {–1, +1, NULL}

    /* result after ORDER BY column1 for IBM, Ingres, InterBase, Oracle, PostgreSQL */

  • {+1, –1, NULL}

    /* result after ORDER BY column1 DESC for Informix, InterBase, Microsoft, MySQL, PostgreSQL, and Sybase */

  • {NULL, +1, –1}

    /* result after ORDER BY column1 DESC for IBM, Ingres, Oracle */

The result is that ORDER BY column1 * -1 and ORDER BY column1 DESC return different results (with NULL in a different spot) unless you use InterBase or PostgreSQL.

The use of expressions in ORDER BY is not 100% portable, but some useful expressions can help speed or clarity:

  • ORDER BY LOWER(column1)

    Useful if case-insensitive sorting is unavailable.

  • ORDER BY SUBSTRING(column1 FROM 1 FOR 6)

    Rough sorts are faster because tags are small.

  • ORDER BY CAST(column1 AS CHAR...)

    Useful if column1's data type is not sortable.

Such functions would be unacceptable if the DBMS called them multiple times. However, the DBMS evaluates the sort-key value only once per input row, at the time it forms the tags. To make sure of this, we tested ORDER BY with a user-defined function that simply counted how many times the function was called. Six of the Big Eight called the function only once per row (GAIN: 6/6).

Portability

Informix and InterBase don't allow expressions in ORDER BY. The gain shown is for only six DBMSs.


To Sort or Not to Sort

"Do not use ORDER BY if the query has a DISTINCT or GROUP BY on the same set of terms, because they have the side effect of ordering rows."

—Kevin Kline et al., Transact-SQL Programming, O'Reilly & Associates

Sometimes there is a temptation to follow such advice and skip ORDER BY if it's certain that the rows are in order anyway. Our tests showed this about such assumptions:

  • SELECT column1 FROM Table1

    is returned in order by column1 if Table1 is clustered and column1 is the cluster key (see Chapter 9, "Indexes") or is otherwise preordered.

  • SELECT column1 FROM Table1 WHERE column1 > -32768

    is returned in order by column1 if column1 is indexed (in ASC order) and the DBMS makes use of the index.

  • SELECT DISTINCT column1 FROM Table1

    is returned in order by column1 if column1 is not unique.

If you add ORDER BY column1 in any of these three cases, the SELECT is always slower (AVERAGE GAIN: 5/8 without ORDER BY). This suggests that DBMSs will not remove unnecessary ORDER BY clauses automatically. However, our tests also showed that the effect is unpredictable with more complex statements that contain any of the following: joins, unions, multiple columns in the select list, long columns, columns indexed with a descending index, or columns requiring secondary sorts (we'll talk about secondary sorts later in this chapter). Finally, we failed to find that any such side effect was documented in any vendor manual. So we urge caution.

The Bottom Line: General Sorts

The three variables that affect sort speed are, in order of importance:

  • The number of rows you select

  • The number of columns you put in the ORDER BY clause

  • The defined length of the columns you put in the ORDER BY clause

An increase in the row count has a geometric effect on sort speed. If you multiply the number of rows by ten, the job takes twenty times as long. Take drastic action to reduce the number of rows you sort.

Take severe action to reduce the number of sorted columns.

Take moderate action to reduce the length of sorted columns.

The fastest sort is an ascending sort of a presorted integer with unique values.

Partial duplicates slow sorts.

Presorting speeds sorts.

It's the defined length that matters.

Some DBMSs sort NULL high. Some DBMSs sort NULL low. Some DBMSs sort NULL at the end of a list. Since there's no standard way of sorting NULL; don't write code that depends on the DBMS putting all the NULLs in a specific place.

The use of expressions in ORDER BY is not 100% portable. But using expressions like ORDER BY LOWER(column1), ORDER BY SUBSTRING(column1 FROM 1 FOR 6), and ORDER BY CAST(column1 AS CHAR...) can help speed or clarity.


SELECT column1 FROM Table1

returns a result set in order by column1 if Table1.column1 is clustered or otherwise preordered.


SELECT column1 FROM Table1 WHERE column1 > -32768

returns a result set in order by column1 if column1 is indexed and the DBMS makes use of the index.


SELECT DISTINCT column1 FROM Table1

returns a result set in order by column1 if column1 is not unique.

"Omit ORDER BY" is a popular tip, and it actually has an effect—but the effect is unreliable. This is fine if the ordering is for cosmetic purposes, or if the number of rows is tiny, but in other cases… . Judge the tip's merits for yourself.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows