July 9, 2011, 9:49 p.m.
posted by blackhat
Suppose you have an unsorted list and a sorted list, as shown in Figure.
Consider 'Belgrade' in Figure's unsorted list. How does the DBMS know if there is a duplicate? Answer: The DBMS must compare 'Belgrade' with both 'Sofia' and 'Budapest' because it can't be sure about duplicates until all the entries in the list have been checked.
Now consider the same question for 'Belgrade' in Figure's sorted list. This time the answer is that the DBMS must compare 'Belgrade' with 'Budapest' alone—as soon as it has compared [n] to [n + 1], the DBMS is done. This means that grouping is fast if the list is sorted. A good strategy, then, is to sort the list for grouping purposes.
Another good strategy is to make a hash list so that equality can be determined with a single hash lookup. We looked at grouped output from the Big Eight and saw that seven of them returned the results in sorted order—a clear indication that they used sorts for grouping. The eighth DBMS—Informix—used hashing.
Because most DBMSs do sort before grouping, you can help the process by listing GROUP BY and ORDER BY columns in the same order. For example, don't write your query like this:
SELECT * FROM Table1 GROUP BY column1, column2 ORDER BY column1
Write it like this instead:
SELECT * FROM Table1 GROUP BY column1, column2 ORDER BY column1, column2 GAIN: 2/7
Don't do this for Informix; it shows a loss. The gain shown is for only seven DBMSs.
Given that it helps GROUP BY to work on a sorted list, let's consider whether it's advantageous to have the list sorted in advance. In other words, let's consider indexes.
Most DBMSs will use indexes for grouping, but the GROUP BY clause has a lower priority than JOIN or WHERE clauses. For example, this SQL statement does not use an index to speed up the GROUP BY clause:
SELECT column1 FROM Table1 WHERE column2 = 55 GROUP BY column1
The reason no index is used should be clear: The DBMS must process the WHERE clause first, and it does so via an index on column2. But after the DBMS has done that, there is no point in looking at the index on column1 because the column1 index is on the whole table—not on the filtered result of the WHERE clause. Because of this, indexes are useful only for very simple grouping. Luckily, a significant number of SQL statements are that simple, so DBMSs have introduced optimizations for the following cases.
GROUP BY alone
GROUP BY itself will use an index if the grouping columns appear in the same order as the index columns. Indexes with DESC columns won't be useful. For example, this SELECT is faster if column1 is indexed:
SELECT column1 FROM Table1 GROUP BY column1 GAIN: 4/8 if column1 indexed
A MIN or MAX function will use an index if the function is alone in the select list and there is nothing after the FROM clause. If there's an index, MIN() can work by getting the first value in the index, and MAX() can work by getting the last value in the index (or by getting the first value in a descending index, in the case of IBM). For example, DBMSs work very quickly with:
SELECT MIN(column1) FROM Table1 GAIN: 6/8 if column1 indexed
Alas, this statement is much harder and slower:
SELECT MIN(column1), MAX(column1) FROM Table1
For queries like this, it's better either to split the statement in two with UNION or use two separate SELECTs, like this:
SELECT MIN(column1) FROM Table1 SELECT MAX(column1) FROM Table1 GAIN: 8/8
A COUNT function will use an index if only one column (or *) is referenced in the query and no clauses follow the FROM clause, for example:
SELECT COUNT(*) FROM Table1 GAIN: 2/8 if indexed
On average the gain is around 2/8. (Recall that this doesn't mean that DBMSs are 2/8 faster; it means that two of the Big Eight are faster, and the rest show no change.) One of the DBMSs that shows no change for this example is Oracle. The lack of a gain is because Oracle indexes don't include NULLs. Thus, for any query whose results might differ if NULLs are present, Oracle can't use an index. This is one of the prices that Oracle pays in return for smaller indexes and faster updates. It can help if you specify that all your grouping columns are NOT NULL. In theory, it would also help if you use COUNT(column) instead of COUNT(*) because COUNT(*) includes NULLs—but in practice it won't help, because COUNT(*) has a special optimization of its own, which we'll look at later in this chapter. (By the way, COUNT(*) usually works far better if you've updated statistics recently.)
A SUM or AVG function will use an index if you make sure that only one column is referenced in the query and that the column is indexed, as in this example:
SELECT SUM(column1) FROM Table1 WHERE column1 > 5
Note that the search condition here is what's important for a covering index (described in Chapter 9, "Indexes"). But SUM and AVG functions don't show a gain if the DBMS uses indexes. In fact, there will often be a loss. This is unfortunate, but what you lose with SUM and AVG will be more than made up by the gains that you'll see with COUNT, MAX, and MIN.
The Bottom Line: Sorting
A good grouping strategy is to sort the list first. Grouping is fast if the list is sorted.
Most DBMSs sort before grouping. You can help the process by listing GROUP BY and ORDER BY columns in the same order.
Most DBMSs will use indexes for grouping. But the GROUP BY clause has a lower priority than JOIN or WHERE clauses so indexes are useful only for very simple grouping.
GROUP BY will use an index if the grouping columns appear in the same order as the index columns.
Set functions will use an index if the function is alone in the select list and there is nothing after the FROM clause. For COUNT, MAX, and MIN you'll see a gain if an index is used. For AVG and SUM, you'll sometimes see poorer performance with an index, but the gains on the other set functions outweigh the loss.