Types of Indexes





Types of Indexes

In this section, we'll talk about the various kinds of indexes—compound indexes, covering indexes, unique indexes, clustered indexes—you might encounter, and how you can take advantage of them.

Compound Indexes

A compound index is an index whose keys contain values derived from more than one data column. The terms compound index, composite index, and multicolumn index all mean the same thing and are equally valid. Each of the Big Eight support compound indexes.

Figure shows some keys from an index that is based on two columns: surname and given_name. Notice that the keys are in order by the left-most column (surname) so a search on given_name alone will not be advantageous.

An extract from a compound index

graphics/09fig07.gif

Similarly, a telephone book may contain both a surname and a given name, but we would be foolish to look for Peter or Trudy in a phone book. It follows then, that a compound index is good when you want to search for, or sort by (a) the left-most column or (b) the left-most column and any of the other column(s)—and under no other circumstance.

Why would you want to make a compound index instead of making two indexes—one on surname and the other on given_name—and letting the DBMS merge the results? That is, why don't we have this scenario:

  1. There are two indexes, one on surname and one on given_name.

  2. The search is:

    
    SELECT * FROM Table1
    
      WHERE surname = 'smith'
    
        AND given_name = 'willy'
    
    
  3. The DBMS searches the first index and comes up with, for example, these five matches:

    
    7, 35, 102, 448, 930
    
    
  4. The DBMS searches the second index and comes up with, for example, these three matches:

    
    16, 102, 137
    
    

    Notice that the two lists of matches are in order, which is a natural result of the fact that in a B-tree keys are sorted by key value first, row locator second.

  5. The DBMS merges the two ordered sets and comes up with the matching value:

    
    102
    
    

Apparently this method is so obvious that people simply assume that it will happen. In fact it normally doesn't[3]. What really happens at step 4 is that, instead of reading the second index, the DBMS reads each of the data rows—{7, 35, 102, 448, 930}. Each time it reads a row, the DBMS looks at the value in the given_name column to see if it is willy.

[3] Some versions of IBM and Sybase actually do use this method, but here we're talking about the normal case where merges don't happen.

The error of thinking that the DBMS will take the sensible route and make full use of all indexes available for the query is probably what leads to the phenomenon of "over indexing" (making indexes for every column that appears in queries). Over indexing is useless. Most DBMSs pick only one index as a driver.

The way to make use of an index on a second column is to put that column in a compound index. In that case, the DBMS takes this approach:

  1. There is a compound index on (surname, given_name).

  2. The search is:

    
    SELECT * FROM Table1
    
      WHERE surname = 'smith'
    
        AND given_name = 'willy'
    
    
  3. The DBMS searches the compound index and comes up with one match:

    
    102
    
    GAIN: 7/8
    
    

The general recommendations that apply for all DBMSs, then, are:

  • Use compound indexes if queries often contain the same columns, joined with AND.

  • Use compound indexes if you have any key based on a one-character column.

  • Use up to 5 columns in a compound index. You can be sure that the DBMS will allow at least 16, but 5 is regarded as the reasonable maximum by some experts.

  • The left-most column should be the one that occurs in queries most often. This should also be the column that is the most selective.

And one more recommendation that applies to an obsolete DBMS but does no harm now:

  • The order of columns in the WHERE clause should be the same as the order in the compound index.

Covering Indexes

Suppose you have a compound index (we'll call it Index_name_sex) on Table1's columns (name, sex). You execute this SQL statement:


SELECT sex FROM Table1

   WHERE name = 'SAM'

When you do so, the DBMS will use index_name_sex for the search because name is the left-most column in the index key. But once it has found the key {SAM, M}, the DBMS says, "Whoa, there's no need to get the data row," because sex is right there in the key. So the DBMS is able to return M directly from the index. When, as in this case, everything in the select list is also in the index that will be used, the index is called a covering index.

Ordinarily, using a covering index will save one disk read (GAIN: 6/8). The gain is automatic but only if the columns in the select list exactly match the columns in the covering index. You lose the gain if you select functions or literals, or if you put the column names in a different order.

Affairs get even better if you can persuade the DBMS to use a covering index regardless of whether it's searching for anything. For example, suppose you search Table1 without a WHERE clause, as in:


SELECT name FROM Table1

  ORDER BY name

Will the DBMS do the "smart" thing and scan the covering index, instead of scanning the table and then sorting? Answer: Sometimes. But Cloudscape won't use covering indexes unless it needs them for the WHERE clause anyway, and Oracle won't use an index if the result might involve NULLs. So there's a gain if you assume that NULL names don't matter and you change the search to:


SELECT name FROM Table1

  WHERE name > ''

  ORDER BY name

GAIN: 7/8

Even if you don't care about either selecting or ordering, a gain might still be achieved using a covering index. If the rows in the table are large, and the index keys are small, then it stands to reason that a scan of the entire index will be quicker than a scan of all the rows in the table. For example, if you want to list the sex column, don't use Query #1. Use Query #2 instead.


Query #1:

SELECT sex FROM Table1

 /* this causes a table scan */



Query #2:

SELECT name, sex FROM Table1

 /* this causes an index scan */

GAIN: 2/8

By using Query #2 even though you don't care about name, you speed up the search by selecting an unnecessary column. In this example, you're really using the index, not as a tool to enhance access to the table, but as a substitute for the table. Thus it's a cheap way to get some vertical partitioning.

The benefits of "treating the index as a table" are often significant, provided you keep a few sobering facts in mind:

  • DBMSs never use covering indexes when there are joins or groupings.

  • The index selection can't be used for UPDATE statements.

  • Adding frequently changed columns to indexes strictly to make them covering indexes violates the principle that indexes should not be volatile.

Unique Indexes

Recall that index selectivity is calculated as follows:

graphics/09equ01.gif


Selectivity can be less than one if NULLs are indexed, or it can be less than one if duplicate keys are allowed. When selectivity equals one, you have perfect selectivity—and a unique index. The name is misleading—it's not the index that's unique; it's the keys in the index that are unique. But nobody's confused about the main thing: Unique indexes are good, and optimizers love 'em. Here's some good reasons to make your indexes unique:

  • The Microsoft and Sybase optimizers give more points to indexes that have a selectivity equal to one or close to one. (Actually the cost-based optimizer depends on statistics to detect selectivity, but the rule-based optimizer just looks at the index's UNIQUE flag.) In fact, these optimizers won't even use an index that has a selectivity less than 0.1.

  • If the DBMS knows in advance that a value can have only one occurrence, it can exit the loop quickly after it has found that value. It won't bother reading and checking again.

So, when the keys really are unique, should you always make the index with CREATE UNIQUE INDEX instead of just CREATE INDEX? Yes—with two exceptions.

The first exception is that you don't want to create an index at all if the DBMS will create one automatically due to a UNIQUE or PRIMARY KEY clause in the table constraints. Most DBMSs will make unique indexes automatically to enforce constraints, so you don't have to. Figure shows which DBMSs automatically create indexes when you define a constraint and whether redundant index creations are blocked.

The second exception is that you don't want to enforce uniqueness if the database is in an early stage and not all values are known yet. You might think that you could just do this:


INSERT INTO Table1 (unique_column)

  VALUES (NULL)

But—surprise!—IBM, Informix, Microsoft, and Sybase won't let you have two NULL keys in a unique index, while Ingres and InterBase won't accept any NULLs at all (see Figure in Chapter 10, "Constraints"). Get used to it: Portable programs can't use both NULLs and unique indexes.

Notes on Figure:

  • UNIQUE Index column

    This column is "Yes" if the DBMS automatically creates an index when you define a UNIQUE constraint.

  • PKEY Index column

    This column is "Yes" if the DBMS automatically creates an index when you define a PRIMARY KEY constraint.

  • FKEY Index column

    This column is "Yes" if the DBMS automatically creates an index when you define a FOREIGN KEY constraint.

DBMSs and Auto-Creation of Indexes
  UNIQUE PKEY FKEY Stops Redundant
  Index Index Index CREATE INDEX
IBM Yes Yes No No
Informix Yes Yes Yes Yes
Ingres Yes Yes Yes No
InterBase Yes Yes Yes No
Microsoft Yes Yes No No
MySQL Yes Yes No No
Oracle Yes Yes No Yes
Sybase Yes Yes No No

Redundant Indexes

Before creating an index, check to see if a similar index is already there. Most DBMSs create an index silently and automatically when you define a PRIMARY KEY or UNIQUE constraint. Some do the same for a FOREIGN KEY constraint.

The auto-creation is sometimes a mistake. For example, suppose you have Table2 with an existing compound index using (column1, column2). Now suppose you do this:


ALTER TABLE Table2 ADD CONSTRAINT Constraint1

   FOREIGN KEY (column1) REFERENCES Table1

With some DBMSs, this SQL statement results in a new index even though an index is already on (column1, column2).

Redundant indexes will, of course, take up extra disk space and slow down UPDATEs, but the real worry is that the optimizer will get confused when there are several equivalent possible paths. So be prepared to drop redundant indexes.

  • Stops Redundant CREATE INDEX column

    This column is "Yes" if the DBMS stops you when you try to create a redundant index.

Unique indexes can cause optimizer confusion. Here's a way to confuse a simple optimizer:

  • Define two indexes as follows:

    
    CREATE UNIQUE INDEX Index1
    
       ON Table1 (sex, social_security)
    
    
    
    CREATE INDEX Index2
    
       ON Table1 (surname)
    
    
  • Now search for:

    
    SELECT * FROM Table1
    
      WHERE surname = 'JONES'
    
        AND sex = 'M'
    
    

See the problem? Because sex is associated with a unique index, Index1 will be chosen as the driver even though Index2 is doubtless more selective. Luckily, none of the Big Eight are so stupid—provided they're in cost-based rather than rule-based mode—and none of the world's database programmers are so stupid either because they know and follow this maxim—When defining a compound index, put the most selective key on the left.

Clustered Indexes

Clustered indexes have been around since SQL was in its cradle, and six of the Big Eight—IBM, Informix, Ingres, Microsoft, Oracle, and Sybase—support them. There are two varieties, which we will call by the nonstandard names "weak" and "strong." In both varieties, the objective of clustering is to store data rows near one another if they have similar values. For example, suppose you have a one-column table—call it Table1—that has these values in column1:


'Jones'

'Smith'

'Jones'

'Abbot'

'Smith'

'Jones'

This is disorganized. If the rows could be reorganized so that this was the order:


'Abbot'

'Jones'

'Jones'

'Jones'

'Smith'

'Smith'

then there would be a better chance that all 'Jones' rows are on the same page. If all 'Jones' rows are on the same page, this query is faster because there are fewer pages to fetch:


SELECT * FROM Table1

  WHERE column1 = 'Jones'

This query is also faster, because of presorting:


SELECT * FROM Table1

  ORDER BY column1

In a cluster, data rows are lined up the way we want them for a range search, an ORDER BY or a GROUP BY. Not only is this good for selections, it also reduces the number of "hot spots"—a potential problem that we'll discuss in Chapter 15, "Locks."

Clustering speeds up queries even if rows aren't 100% in order.

Provided that there is any ordering, queries will be faster on average—provided that the cluster key (column1 in our earlier example) is important. The DBMS can use a clustered index to find out what the row order should be (an index is a handy structure for the purpose because indexes are ordered). You just have to tell the DBMS that you want a clustered index and that you want column1 to be the cluster key. There can, of course, be only one clustered index per table because the data rows can only be sorted in one order.

In contrast, when a table has only a nonclustered index, pages and data rows are not stored in an order based on the index key, but rather in a heap—like the pile of papers on an untidy person's desk. Usually a new row is just dumped wherever it can fit in the file—that's why it's called a heap.

The difference between a weak-clustered index and a strong-clustered index is the fierceness with which the DBMS maintains cluster-key order. Most DBMSs have weak-clustered indexes, and they only guarantee to make a good-faith attempt to maintain order. Two of the Big Eight—Microsoft and Sybase—have strong-clustered indexes, and they maintain order forcibly, even when this requires a lot of reorganizing of both index keys and data rows. (Oracle has both weak-clustered indexes and index-organized tables; index-organized tables are in fact strong-clustered indexes, but Oracle doesn't call them that.) Because their effect is major, we will examine strong-clustered indexes more closely.

Strong-clustered indexes

The concept is simple: The leaf level of a strong-clustered index doesn't contain mere index keys. Instead, the leaf level contains the data rows themselves. Figure and Figure illustrate this by showing a conventional "data heap and separate" structure (Figure) and a radical "clustered index with integrated data" structure (Figure) containing the same data.

A heap-and-index structure

graphics/09fig08.gif

A clustered index

graphics/09fig09.gif

The pointer from an index key to a data row is called a row locator. The structure of the row locator depends on whether the data pages are stored in a heap or are clustered. For a heap, a row locator is a pointer to the row. For a table with a strong-clustered index, the row locator is the cluster key.

Strong clustering certainly looks like a winning idea. Assume you have two tables based on Figure and Figure and execute this query on each:


SELECT * ...

  WHERE column1 = 'SUSAN'

Which query is faster? Well, the DBMS will need three I/Os if it has a heap (two I/Os to scan the index and one more to get the data row)—but only two I/Os if it has a cluster (two I/Os to scan the index and zero needed to get the data row). It doesn't really matter even if the data rows in the leaf page are likely to be twice as big as the index keys, because—remember the earlier talk about the number of layers being the important criterion?—doubling the key size doesn't double the access time.

This advantage appears for strong-clustered indexes only. With either a weak- or a strong-clustered index, data rows are in order. But the fact that the expression WHERE column1 = 'SUSAN' works with one less I/O is an additional effect that appears only with strong-clustered indexes.

To sum it up, both varieties of clustered indexes are good for GROUP BY, ORDER BY, and WHERE clauses. Strong clusters are also good for long keys (because a strong-clustered index on a long key takes no extra space for the leaf level). An extra benefit if the DBMS is Microsoft is automatic garbage collection (Microsoft doesn't have automatic garbage collection for ordinary heap tables).

INSERTs are harder with strong clusters because you have to make room between existing rows to fit in the new value according to cluster key order. But two subsequent INSERTs probably won't go to the same place, so clusters reduce contention.

UPDATEs are harder with strong clusters because any change in the cluster key means the row order can change too. But you are less dependent on ROWID with clusters; access using the cluster key is almost as quick.

There is no question that any changes that affect the cluster key will be huge and slow, because when the key value changes, the whole row must move in order to stay in order. However, there is a simple remedy: Never update the cluster key. In fact, under the hood, there is no such thing as an UPDATE of a cluster key—the DBMS translates such an UPDATE to a DELETE followed by an INSERT.

Choice of clustered key

The conventional advice is that you should have a clustered index for almost any table that's not tiny or temporary. This means you have to pick a set of columns for the cluster key—and it better be a good pick, because it is extremely inconvenient to change the cluster key after the table is full of data. By definition, there can only be one cluster key because the rows can only have one order. So what columns should form the cluster key? Good news. Many have asked that question before, and a body of sage counsel can guide the picking. Unfortunately, there's also bad news—priorities differ depending on whether the DBMS has strong or weak clustering.

When clustered indexes are strong, the primary key can be the cluster key. This saves time and confusion, and is often the default behavior, so we'll just defer this piece of sage counsel to Chapter 10, "Constraints." In situations where the primary key isn't there or isn't an obvious choice for a cluster key, you should explicitly state that the primary key is nonclustered. Then choose a cluster key with these characteristics:

  • It shouldn't be volatile.

  • It shouldn't be a "serial" data type—that is, the key should not auto-increment.

  • It should be short (preferably an integer).

  • It should be unique. (Some DBMSs—for example, Microsoft—will add a uniquifier integer if the key is not unique, but that's worth avoiding.)

  • It should be the column on which you'll most often want to do range searches.

  • It might as well have the order in which you'll want to get results.

However, when clustered indexes are weak, the rule "the primary key can be the cluster key" is weak too. Here are two rules for choosing the cluster key for a weak-clustered index, especially popular among IBM experts:

  • Base the cluster key on a foreign key. The effect of clustering is that most rows with the same value for the foreign key will be together. Therefore a search like this:

    
    SELECT ... WHERE foreign_key_column = 70
    
    

    has a good chance of picking up all the matching rows with one page read. Primary-key/foreign-key joins will be faster for a similar reason. Notice that this advantage exists precisely because foreign keys, unlike primary keys, are not unique.

  • Base the cluster key on the column with which you most often will do range searches or sequential passes. For example, if you will frequently do this type of search:

    
    SELECT ... WHERE column1 BETWEEN 70 AND 150
    
      ORDER BY column1
    
    

    then column1 is a good choice for the cluster key.

Beware, though, if the possible cluster key is a monotonically sequential, or serial key—for example, an integer whose values always go up and always go up by the same amount, when you INSERT. The problem with monotonics (which are sometimes hidden behind names like "identity" or "timestamp" columns—see Chapter 7, "Columns") is that the nth row and the (n + 1)th row will both fit in the same page—so there might be contention for the same resource. Usually, in multiuser systems, you want rows to be dispersed.

This above all—The benefits of clustered indexing depend on a monster assumption; namely, that you want to retrieve and sort by the cluster key far, far more often than with any other key.

Secondary indexes to a strong-clustered index

Consider this scenario. You've defined Table1 with a strong-clustered index—the cluster key is column1—and inserted the following data:

Table1.column1 Table1.column2
1 Willy
10 Margaret
100 Fiona

Now you want to add a secondary index—a nonclustered index on column2. Here's the problem. Because each row of the table is in an index leaf page, it can move within the page whenever another row is inserted or deleted above it, and it can even move to a different page if there's an index split. If that happens, the column2 pointers are useless because the pointer part of an index key (aka row locator or bookmark) is normally a row identifier (aka ROWID or RID). Recall from Chapter 5, "Joins," that a row identifier is a physical address, such as {File number, Page number, Row number within page} . But the physical address of the row can change—so a ROWID is unreliable.

One solution to this problem—the old Oracle8 solution—is to disallow the creation of any nonclustered indexes to a strong-clustered table.[4] This has the unfortunate effect of limiting the usefulness of strong-clustered indexes, and in fact they're rarer in Oracle shops than in Microsoft shops. You can write your own schemes for doing "manual joins," and so on, but that's no fun.

[4] The latest Oracle incarnation—Oracle9i—now allows secondary indexes to clustered tables. Oracle uses the term "index-organized tables" where we use "strong-clustered index" and insists that the cluster key be the table's primary key—but the differences are at a detailed level, and we won't go into them here.

Another solution—the Microsoft solution—is to make the row locator be the cluster key. That way, when the DBMS looks up column2, it ends up with the cluster key value that uniquely identifies the row: the value of column1. For example, suppose you ask for information about Fiona:


SELECT * FROM Table1

   WHERE column2 = 'Fiona'

Now the DBMS can look up the data row in the clustered index, by searching for the column1 value—an elegant solution. If the clustered index and the secondary index both have three levels, the DBMS needs (3 + 3) six I/Os to reach the data row, given the column2 value.

The search would be faster if there was a ROWID. In fact, a heap is "as good" (same I/O) if you're always looking for a unique value. So let's suppose this—You have a table with only two unique or nearly unique indexes. Access is equally likely via either index, so you're better off with a heap.

This may sound counter-intuitive, so let's examine the situation more closely. Here, once again, are our assumptions:

  • Both indexes are unique, or nearly so.

  • Both indexes have three levels.

  • Both indexes are equally likely to be chosen.

Now, if you're looking up in a heap and the number of index levels is always three, then the average I/O count for the lookup is four—three I/Os to access via either index and one I/O to access the heap page, given the ROWID. On the other hand, if you're looking up in a cluster, the time to look up with the cluster key is three I/Os, and the time to look up via the secondary index is six as shown earlier. The average when either index is equally likely is thus 4.5 I/Os—((3 I/Os + 6 I/Os)/2). This is more than the four I/Os needed for a heap, and so a heap is better if access is equally likely with either index. Still, Microsoft recommends clustered indexes "always."

While we're on the subject of strong-clustered indexes, here are some tips on their use:

  • Remember that a nonclustered index on a table points to the leaf page of the table's clustered index (because the clustered index's leaf page is the same as the data page). It is therefore a duplication if you have a column in the nonclustered index that's also in the clustered index. For example, suppose you have a strong-clustered index on (emp_id, sex) and are considering adding another index on (surname, sex). Don't. Since sex is already "covered" in the clustered index, make the nonclustered index on (surname) alone.

  • A covering index is even more important because you're not just saving one I/O, you're saving three or four.

  • If you want quick access to a table's cluster key columns alone, then make a nonclustered index of the same columns. The index must be a covering index, because you don't want to access the actual data rows. The index must also have a short key, because the advantage lies in the fact that the nonclustered index should have fewer layers than the clustered index. This tip will work, but if the clustered index is so big that it has too many levels, you should solve that problem with normalizing or partitioning, which we discussed in Chapter 8, "Tables."

The Bottom Line: Types of Indexes

A compound index is an index whose keys contain values derived from more than one data column. A compound index is good when you want to search for, or sort by (a) the left-most column or (b) the left-most column and any of the other columns.

Over indexing is useless because the DBMS often picks only one index as a driver. The way to make use of an index on a second column is to put the column in a compound index.

Use compound indexes if queries often contain the same columns, joined with AND.

Use compound indexes if you have any key based on a one-character column.

The left-most column in a compound index should be the one that occurs in queries most often. It should also be the most selective column.

When everything in the select list is in the index that will be used, the index is called a covering index. Ordinarily, using a covering index will save one disk read, but only if the columns in the select list exactly match the columns in the covering index. You lose the gain if you select functions or literals, or if you put the column names in a different order.

Whenever possible, persuade the DBMS to use a covering index regardless of whether it's searching for anything.

If the rows in the table are large, and the index keys are small, persuade the DBMS to use a covering index.

The benefits of "treating the index as a table" are often significant. Just remember that (a) DBMSs never use covering indexes when there are joins or groupings, (b) the index selection can't be used for UPDATE statements, and (c) adding frequently changed columns to indexes strictly to make them covering indexes violates the principle that indexes should not be volatile.

Index selectivity is calculated as:

graphics/09equ01.gif


Selectivity can be less than one if NULLs are indexed, or it can be less than one if duplicate keys are allowed. When selectivity equals one, you have perfect selectivity—and a unique index.

If the DBMS knows in advance a value can occur only once, it can exit the loop quickly after it has found that value. It won't bother reading and checking again. So unique indexes can save time.

Be prepared to drop redundant indexes.

When defining a compound index, put the most selective key on the left.

There are two main types of cluster indexes: "strong" clusters and "weak" clusters.

Strong clusters have these characteristics:

  • The leaf level of a strong-clustered index doesn't contain index keys. Instead, the leaf level contains the data rows themselves.

  • All data rows are in order.

  • Strong clusters are good for GROUP BY, ORDER BY, and WHERE clauses, and for long keys.

  • INSERTs are harder with strong clusters because you have to make room between existing rows to fit in the new value according to cluster key order. But two subsequent INSERTs probably won't go to the same place, so clusters reduce contention.

  • UPDATEs are harder with strong clusters because any change in the cluster key means the row order can change too. But you are less dependent on ROWID with clusters; access using the cluster key is almost as quick.

  • Updates that affect the cluster key will be huge and slow.

  • The primary rule for choosing a strong cluster key is—Let the cluster key be the primary key.

  • The secondary rule for choosing a strong cluster key (for situations where the primary key isn't an obvious choice) is—Choose a cluster key with these characteristics: (a) not volatile, (b) short and preferably an integer, (c) unique, (d) on which you'll most often want to do range searches, and (e) in the order in which you'll want to get results.

  • If you want quick access to a table's strong cluster-key columns alone, make a nonclustered index of the same columns. The index must be a covering index and must have a short key.

Weak clusters have these characteristics:

  • Data rows may be in order, but the DBMS does not guarantee this.

  • Weak clusters are good for GROUP BY, ORDER BY, and WHERE clauses.

  • The primary rule for choosing a weak cluster key is—Base the cluster key on a foreign key or base the cluster key on the column with which you most often will do range searches or sequential passes.

Beware of the cluster key that is monotonically sequential. The problem with monotonics (aka serials) is that the nth row and the (n + 1)th row will both fit in the same page—so there might be contention for the same resource. Usually, in multiuser systems, you want rows to be dispersed.

The benefits of clustered indexing depend on a monster assumption—that you want to retrieve and sort by the cluster key far, far more often than with any other key. Choose your cluster keys accordingly.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows