The Storage Hierarchy

The Storage Hierarchy

You can doubtless think of many examples of storage hierarchies in ordinary life. For example, people live in neighborhoods, which are in towns, which are in regions, countries, continents, and so on up the line. The relations are generally many-to-one, although there are occasional one-to-one correspondences (e.g., Australia is both a country and a continent), and occasional exceptions (e.g., a person can straddle a city boundary).

Figure shows the storage hierarchy—the physical constructs of a database. The hierarchy of physical objects suggests that—with occasional one-to-one correspondences or exceptions—data rows live in pages, which are in extents, which are in files, tablespaces, and databases. There is a reason for each level of grouping. To see what the reason is, we'll go through each of those objects in order, up the line.

Physical constructs; the storage hierarchy



Depending on the DBMS, a page is also called a data block, a block, a blocking unit, a control interval, and a row group.

A page is a fixed-size hopper that stores rows of data. Pages have four common characteristics, which are not true by definition but are always true in practice. They are:

  • All pages in a file have the same size. Indeed for some DBMSs, it is true that all pages in all files have the same size, but the usual case is that you have a choice when making a new object.

  • The choice of page sizes is restricted to certain multiples of 1024 (1KB), in a range between 1024 and 65536—that is, between 1KB and 64KB.

  • The optimum page size is related to the disk system's attributes. Smaller page sizes like 2KB were once the norm, but disks' capacity tends to increase over time, so now 8KB is reasonable, while 16KB is what we'll upgrade to soon.

  • Pages contain an integral number of rows. Even for the rare DBMSs that allow large rows to overflow into later pages, the very strong recommendation is that you should avoid overflow.

One other thing about pages is usually true in practice and is always recommended—All rows in a page are rows for the same table. It is true that IBM and Microsoft allow mixing rows from different tables into the same page, while Oracle allows mixing in a special condition called clustering, but these exceptions have advantages in special cases only. Figure shows what the Big Eight do with pages. The most interesting fact is how slight the differences are.

Notes on Figure:

  • Default Page Size column

    Shows the DBMS's default page size, in KB.

  • Other Allowed Page Sizes column

    If the DBMS allows you to change the default page size when creating an object, this column shows the page size options (in KB) available. This column is "No" if the DBMS doesn't allow you to change the page size.

  • Overflow Allowed column

    This column is "Yes" if the DBMS allows a too-large row to span more than one page.

  • Mixing Allowed column

    This column is "Yes" if the DBMS allows rows from different tables to be on the same page, is "Clu" if that's possible but only for a clustered file, and "No" otherwise.

DBMSs and Pages
  Default Page Size Other Allowed Page Sizes (KB) Overflow Allowed Mixing Allowed
IBM 4KB 4,8,16,32 No Yes
Informix 4KB No No No
Ingres 2KB 2,4,8,16,32,64 No No
InterBase 1KB 2,4,8 Yes No
Microsoft 8KB No No Yes
MySQL 1KB No No No
Oracle 8KB 2,4,8,16,32 Yes Clu
Sybase 2KB 2,4,8,16 No No

Now that we know what a page is, let's look at what a page is for:

  • A page is a minimal unit for disk I/O.

    That doesn't mean that DBMSs read and write one page at a time—in fact most DBMSs will read several pages at once—but a page is the smallest amount that a DBMS could read, and the smallest amount that a DBMS usually writes. There is no such thing as a disk read of a mere row! The objective is to make a page I/O synonymous with an I/O of the minimal unit that the disk drive uses. For example, on MS Windows NT with more than 64MB RAM, a common sector size is 512 bytes, there is one sector per cluster, and 8KB is an optimum read size (assuming the NT File System, NTFS). On a hand-held machine, a 1KB page is better for footprint reasons.

  • A page is a unit for locking.

    It's no longer the minimal unit for locking, because most DBMSs allow locking at the row level too. But page locking is traditional and convenient. We'll talk more about page locks in Chapter 15, "Locks."

  • A page is a unit for caching.

    A read is a transfer from disk to memory. Once the DBMS has a page in memory, it will want to keep it there for a while. It keeps the page in a buffer, or buffer pool—an in-memory copy of a bunch of pages, with a fixed size. The DBMS maintains the buffer pool itself (it does not rely on the operating system, though any caching by the operating system is a nice bonus); it generally reads several pages at once into the buffer and throws out individual pages according to a Least-Recently-Used (LRU) algorithm. (If the cache holds 100 pages, and you have a 101-page table, then repeated full-table scans will defeat the LRU algorithm. There are nonstandard ways to tweak the cache size or to lock tables in cache.)

  • A page is a unit for administration.

    By that we mean that DBMSs store a significant amount of housekeeping information in a page, outside the row data. This housekeeping information will include a header (which may contain row offsets, overflow pointers, identifications, last modified date, NULL bits, some indication of size, etc.), some free space, and sometimes a trailer.

Because a page is a minimum unit, when you ask for a single row, the DBMS will access a page. The only thing you'll see is the single row, but now you know what's really going on. And you can use that knowledge. Here's what it implies:

  • Big rows waste space.

    Given that the size of the header and trailer is usually about 100 bytes, a 2KB page has about 1,950 bytes for storing row data. So if your row size is 1,000 bytes, you can fit one row per page, and there will always be 950 bytes of free space. But halve the row size to 500 bytes, and you can fit three rows per page (500 * 3 = 1500) and still have 450 bytes of free space left over. Halve the row size again, and even less space will be wasted.[1]

    [1] The precise overhead is different with each DBMS. For example, Ingres requires 38 bytes per 2KB page plus two bytes per row, while Oracle requires between 84 and 107 bytes per page.

  • Two get in for the price of one.

    When you access two rows on the same page, the second access is free. Well, maybe not free, but the point is you've already paid for it—the DBMS has the whole page in the cache. You can't specify which page a row goes into, but you can improve your luck by doing two INSERTs together, by fetching in the DBMS's default order instead of asking for ORDER BY, or by sequencing (that is, forcing rows to be close together by using SERIAL data types or IDENTITY columns or auto-incrementing objects). Most obviously, you could just shorten rows so that more are in a page.[2]

    [2] Or you can make your pages larger. For some DBMSs, it's only an option at install time (for example, with Oracle it's an init.ora parameter). Therefore we do no more than mention the possibility.

Both of these implications suggest that making rows smaller is a good thing. But there is a limit to all good things, and the limit is 512 (it's just a house keeping thing, having to do with the way some DBMSs store page header information in a fixed space). To sum it up, your aim should be to have at least 10 rows per page, but fewer than 512 rows.

LOB Pages

An exceptional situation arises when a column is defined as a BLOB or another LOB data type such as TEXT, IMAGE, or CLOB. Recall (from Chapter 7, "Columns") that BLOB columns are rarely on the same page as the other columns of a table—they get pages of their own. With many DBMSs, BLOBs are even automatically in a file of their own—the best method.

With most DBMSs (MySQL is an exception), BLOB pages can't be shared. You can't store two BLOBs on one page, and you can't store BLOB and non-BLOB columns on one page. Therefore any BLOB that isn't NULL will involve a lot of overhead space: the pointer from the main page (about 10 bytes), the page header (about 80 bytes per BLOB page), and the unused space at the end of the last BLOB page (a few thousand bytes on average).

If a table has LOBs in it, you should avoid table scans, and you should avoid SELECT * ... statements.


An extent is a group of contiguous pages. Extents exist to solve the allocation problem. The allocation problem is that, when a file gets full, the DBMS must increase its size. If the file size increases by only one page at a time, waste occurs because:

  • The operating system must update the file allocation tables. The amount of updating is about the same whether the addition is one page or eight pages.

  • If file A grows, then file B grows, then file A grows again, and so on; the operating system will have to maintain a succession of short (one page) chains: ABABABAB. This is a type of fragmentation. As we've hinted before, fragmentation is bad; we'll discuss it later in this chapter.

  • The DBMS must update the data dictionary every time a file grows.

Suppose, though, that the DBMS adds eight pages when the file gets full, instead of only one page. That solves the allocation problem. Call that eight-page amount an extent, and you now have a definition of what an extent is, in its primary meaning. Notice that in the primary meaning, extents are units of allocation, not units of I/O as pages are.

Now suppose that, in the CREATE TABLE statement, you were able to define two things: (a) an initial extent size (how much to allocate during CREATE) and (b) a "next" extent size (how much to allocate each subsequent time when the file gets full). Well, you can. Here's an example using Informix's nonstandard SQL extension syntax:


   column1 INTEGER,

   column2 VARCHAR(15),

   column3 FLOAT)


Depending on the DBMS, you can usually define the initial and next extent sizes, but most people make no specification and just use default values. Typical default values are: 16x4KB pages (IBM), 8x8KB pages (Microsoft)—or 1MB for certain tablespaces (Oracle). Clearly Oracle is different. Oracle believes that it's a good idea to preallocate a large amount of file space. This makes the allocation problem a rare phenomenon.

To see extents in action, we used Ingres to create a table with ten rows per page and 16 pages per extent. Then we timed INSERT speed. We found that INSERT speed was consistent except for a small hiccup (7% slower) on every 10th row (i.e., at each page end) and a slightly larger hiccup (20% slower) on every 160th row (i.e., at each extent end). It is possible to predict an average INSERT time, but the worst case is much worse than the average. All because of extents.

Read groups

A read group is a group of contiguous pages. For many DBMSs, a read group is the same thing as an extent, but there's a logical difference. A read group is the number of pages that are read together, while an extent is the number of pages that are allocated together. Figure shows the usual name and default size of a read group for the Big Eight.

Consider IBM. Figure tells us that IBM's default page size is 4KB, and the default read group size is 16 pages. This SQL statement forces IBM to perform a table scan:

SELECT * from Table1

  WHERE column1 LIKE '%XX%'

DBMSs and Read Groups
  Vendor's Term Usual or Recommended Size
IBM Extent 16x4KB pages
Informix Extent 16x4KB pages
Ingres Read Group 8x2KB pages
InterBase None 4KB
Microsoft Extent 8x8KB pages
MySQL Read Group 16x1KB pages
Oracle Extent 8x8KB pages
Sybase Extent 8x2KB pages

When executing this SELECT, IBM reads 16 pages at a time into the buffer. This is faster than reading 16 pages, one at a time. It's the right thing to do because all pages must be examined in order to resolve the query. It would be wrong to keep all the pages in cache, though, because pages read during a table scan are unlikely to be needed again soon.

So a read group is a unit of I/O, but not the minimal unit—the page is the minimal unit, and the page is also the administrative unit for caches. It comes down to this: If most of your SELECTs involve table scans, then you want the read group size to be double the default amount. If the only way to accomplish this is to double the extent size, so be it. Increasing the read group size will also mean that more space is needed in memory for the buffer pool.


If the extent size is not the same as the read group size, then it should be an exact multiple of the read group size.


A file is a group of contiguous extents. And that's about it.

Surprisingly, a file is not a physical representation of a table. It could be, but usually it isn't because of one of the following:

  • Most DBMSs allow mixing of extents. That is, the first extent in a file can contain pages for Table1, and the second extent in the same file can contain pages for Table2. This is true even for those DBMSs that do not allow mixing within a page.

How Important Is It to Be in Cache?

Consider the famous Oracle cost-based optimizer I/O formula:


In this formula, "Disk I/O" means "access to the disk drive" and has a weight of one. "CPU I/O" means "access to a cache copy" and has a weight of 0.001. "Net I/O" means "access over the network" and has a weight of 1.5. In Chapter 17, "Cost-Based Optimizers," we try to figure out what this formula is good for. Right now, we only need to know that Oracle considers a disk access to be 1,000 times more expensive than a cached access.

  • A few DBMSs, as an option, don't bother with files at all. Usually this is a Unix option. The idea is that by performing I/O on the operating system's "nonblock raw device" one can avoid the overhead of the file system's features, such as caching. (After all, if the DBMS does its own caching well, the operating system's cache is redundant.) Many people are unaware that Windows NT allows raw I/O without files; Sybase tested the option once and found that raw I/O is about 2% faster than I/O via files.

  • Files can be split across more than one drive, either because the file's simply too big or to enhance the advantage of partitioning. (Partitioning is the process of splitting a database object—usually a tablespace, table, or index—into two or more physical locations, or partitions; see the following section.)


A partition is a group of contiguous extents. Often a partition is a file, but it doesn't have to be.[3] Suppose you have four extents, numbered {1, 2, 3, 4}. Figure shows what a two-partition system could look like.

[3] Informix calls a partition a fragment. We think that is a horrible choice of words, and we will never use the word "fragmentation" when we mean "partitioning."

A two-partition system


Partitions are bad if there are few extents and few database users. Why? Because the fundamental principle is that rows should be crammed together. That helps caching and reduces the number of page writes.

Partitions are good if—and only if—there are many extents and many database users. Why? Two reasons, actually. The first is that in a multiuser environment, partitioning reduces contention. The second is that in a multidisk-drive environment, partitioning increases parallelism.

How does partitioning reduce contention? Suppose that User AMY and User BOB are both doing an INSERT. User AMY arrives first and locks the final page of the final extent. (In practice the DBMS starts by looking for free space within the file, but this example is realistic enough.) Now User BOB arrives and—here's the trick—locks the final page of the final extent in the other partition. BOB is not locked out because AMY and BOB do not have to contend for the same end page.

How does partitioning increase parallelism? Suppose that the partitions are on different independent disk drives (Disk #1 and Disk #2) and that the DBMS must read a page in Extent #1 (stored on Disk #1) and also a page in Extent #2 (stored on Disk #2). In this case, the DBMS can issue simultaneous "seek and read" commands for the two pages. Because the drives work independently, the time to perform both commands is the same as the time to perform only one command. In fact you can realize some of partitioning's advantages with any multithreaded operating system even if the partitions are areas within a single file on a single disk drive, but if you have the hardware to do it, you should keep partitions as separated as they can be.

Balanced partitions are best. If there is a 50/50 chance that a user will want Partition #1 instead of Partition #2, and there is a 50/50 chance that a desirable page is in Partition #1 instead of Partition #2, then there's balance. Obviously there would be more contention if User AMY and User BOB were more likely to go for the same partition, or if Page #1 and Page #2 were more likely to be on the same partition.

In other words, it is wrong to say—Let's separate the much-used records from the little-used records. That's fine if the little-used records are being shuffled off to a slow or remote storage medium, but we're talking about two storage areas on the same disk, or two disk drives of equal quality. Balanced partitions are best.

IBM, Informix, Ingres, Microsoft, Oracle, and Sybase support partitioning with non-standard extensions to the CREATE TABLE or ALTER TABLE statements (InterBase and MySQL don't support partitioning). You should be able to specify under what conditions an INSERT will head for Partition #1 rather than Partition #2. For example, with Informix's non-standard SQL-extension syntax, you can specify conditions like:


       branch_number < 7 IN Partition1,

       transaction_time < TIME '11:00:00' IN Partition2

This attempt to use logic is vain, because there is no guarantee that the conditions are changeless. If they were, you would probably use separate tables instead of separate partitions. A simple hash algorithm that ensures random or round-robin distribution of rows is all you really need.

Partitioning is normal in all big shops except data warehouses. Partitioning is unpopular for data warehouses because:

  • Reducing locks is unimportant when transactions are read-only.

  • Reports are sorted, but partition algorithms can destroy order.

  • Sequential processing is faster if all rows can be accessed with one scan.

Naturally, even data warehouses do some distribution in order to encourage parallel queries (putting joined tables and their indexes on different drives, say). Real partitioning, though, is for OLTP. People over-generalize and say that sequential queriers put similars together while ad-hoc queriers distribute them randomly, but that's an aide-memoire rather than a true observation.

Manual Partitioning

Even if your DBMS doesn't support partitioning, you can simulate it by splitting your table into multiple tables with duplicate row definitions. Typical criteria are geographic (e.g., "Branch #1 in this table and other branches in that table") or temporal (e.g., "1999 transactions in this table and later transactions in that table").

This sort of thing works as long as the primary key contains a code that tells you what table to look in, and you're willing to UNION searches with conditions that span the partitioning criteria. The advantage is that you can do most of your work with smaller tables. The disadvantage is that rows might have to move from one table to another when a value changes. Eventually such systems fall apart and are replaced by automatically partitioned systems.


A tablespace (also called a dbspace by some DBMSs, e.g., Informix) is a file, or a group of files, that contains data. For example, this non-standard Oracle SQL-extension statement creates and associates a tablespace with a 10MB file:


  DATAFILE '\disk01\tb.dbs' SIZE 10M;

Here's another example, using IBM's syntax:



  USING ('d:\data1')

This IBM tablespace has no preset size limitation. It can grow up to the size allotted for the file named d:\data1.

A tablespace can contain a single table, or it can be shared either between one table and its indexes or between multiple tables. In other words, it's possible to mix extents from different objects in the same tablespace. Mixing brings some advantages for administration but has no great performance effect.

The typical reasons for dividing databases into tablespaces are:

  • So that different users will have access to different physical areas

  • So that users' data will be separate from system data

  • So that extents with different sizes will be separate

  • So that partition schemes can be tailored to table or application

  • So that backups or maintenance can be done on only part of a database

  • So that database backups and restores can be run in parallel

The Bottom Line: Storage Hierarchy

A page is a fixed-size hopper that stores rows of data. A page is a minimal unit for disk I/O, a unit for locking, a unit for caching, and a unit for administration.

All pages in a file have the same size. The choice of page sizes is restricted to certain multiples of 1024, between 1KB and 64KB. The optimum page size is related to the disk drive's cluster size; 8KB is the current usual page size.

Pages contain an integral number of rows. All rows in a page are rows for the same table.

When you ask for rows, the DBMS gets pages.

Big rows waste space. Keep row size small—but not too small. Aim for a minimum of 10 rows and a maximum of 511 rows per page.

When you access two rows on the same page, the second access is free—that is, access to Row #1 is free if the page is still in the buffer pool since you accessed it for Row #2. Use this fact to help performance: do two INSERTs together, fetch in the DBMS's default order, make rows smaller or pages larger.

LOB columns are rarely on the same page as the other columns of a table—they get pages of their own. If a table has LOBs in it, avoid table scans and SELECT * ... statements.

An extent is a group of contiguous pages that are allocated together.

A read group is a group of contiguous pages that are read together.

If the extent size is not the same as the read group size, then it should be an exact multiple of the read group size.

A file is a group of contiguous extents.

A partition is also a group of contiguous extents. Often a partition is a file, but it doesn't have to be.

Partitions are bad if there are few extents and few database users. Partitions are good if there are many extents and many database users because partitioning reduces contention and increases parallelism.

If you have the hardware to do it, keep partitions physically separated.

Don't make assumptions that are only true for not-yet-partitioned databases. For example, ignore the observation that some tables have rows in order by date of insertion.

Start partitioning before you have 100 users or one million rows or two disk drives. Don't worry about tables that are small or are used by only a small group of users. Strive for balance. Balanced partitions are best.

A tablespace is a group of files.

It's possible to mix extents from different objects in the same tablespace. Mixing brings some advantages for administration, but has no great performance effect.

     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows