Extract Lots of Rows






Extract Lots of Rows

You want to run a query on the database. The query returns at least 100,000 rows, and you know that you really don't want all of them. Try as you might, you can't cook up a filter statement to reduce the row count!

In some cases, you can't employ filtering [Hack #8] to reduce your database server stress and you need to slurp in an amazing number of rows.

What if you are looking through your database for IP addresses that are also keys to a hash table called %IP? In theory, you could extract those keys from the hash table and code them into the WHERE clause of your SELECT statement. The result would be a huge SQL statement with thousands of OR statements. It is possible that the query would be too large to parse, and even if it was possible, the OR statement approach would have poor performance in comparison to having the IP information in an index.

To make things even more complicated, suppose that you have read-only access to the database for security reasons (so that you won't be able to create a temporary table that contains all the IP addresses).


In this hack, suppose you have the table weblog with 100,000 rows, and the program contains IP, a hash table with 1,000 entries.

The examples shown here are against an Oracle database from Perl on Linux, but you could adapt this to any combination of database and programming language. The database server is on the same machine as the Perl program. Here's a program that grabs all the rows from weblog and checks them until it finds an IP address that's also in the hash table:

my $sql = "SELECT ip from weblog";
my $sth = $dbh->prepare($sql);
my $rsh = $sth->execute(  );
my $match;
while ( $sth->fetchrow_arrayref(  )) {
  if (exists $IP{$_->[0]}) {
    # found, so end query
    $match = $_->[0];
    $rsh->finish;
    last;
  }
}

This takes about 2.5 seconds in our test environment. The test environment is a worst-case data set, where none of the hash values is in the data set! The implementation of how each row is retrieved (fetchrow in Perl's DBI) is driver dependent. It might get just one row at a time or 1,000 at a time and cache them until they're needed.

If the driver gets only one row at a time, this query will spend too much time communicating with the database. If it gets thousands of rows at a time it might get data that will never be used.

Use a Big Buffer

There is no easy way to balance this, but in such circumstances, it is likely that seeking bigger blocks and caching them in memory until needed is probably going to give the least number of round-trip times, and the highest throughput. It will use up more memory, but you can always buy more. You do need to minimize data copying, so you should use memory references rather than copies, and you should try to get as close to the buffered data as possible. The DBD online manual (http://search.cpan.org/~timb/DBI/DBI.pm) suggests something such as the following:

my $sql = "SELECT ip from weblog";
my $sth = $dbh->prepare($sql);
my $rsh = $sth->execute(  );
my $cache = [];
my $match;
while(  shift(@$cache) || # Use cache if it has >0 rows, otherwise query db
        shift(@{$cache=$sth->fetchall_arrayref(undef,10_000)||[]}) ) {
  if (exists $IP{$_->[0]}) {
    # found, so end query
    $match = $_->[0];
    $rsh->finish;
    last;
  }
}
$cache = undef;

This takes less than 1.5 seconds in our test environment.

This example asks for 10,000 rows at a time, and hopefully you have enough memory to hold that much! You can, of course, tune this number to your particular data set. You will be working very close to the limits of the database drivers, but because this is recommended practice, it seems reasonable to assume that as drivers develop they will be trying to improve performance with this approach in mind. Note that you can, if you want, actually specify the starting row and the number of rows from that point which you want your query to return (see "Extract a Subset of the Results" [Hack #72]).

When using this technique, you may find that memory utilization is too excessive, so the next best way is to use variable binding and a normal loop with fetch.

Use Variable Binding

In order to obtain good performance with the following approach, you will be placing your trust in the driver's caching strategy:

my $sql = "SELECT ip from weblog";
my $sth = $dbh->prepare($sql);
$sth->execute(  );
my $match;
my $col;
$sth->bind_columns(\$col);
while ( $sth->fetch) {
  if (exists $IP{$col}) {
    # found, so end query
    $match = $col;
    $sth->finish;
    last;
  }
}

In our test environment, this runs in 1.9 seconds.

Make a Series of Round Trips

The preceding examples offer the best performance if you really insist on downloading the entire table with a simple SELECT and performing all the searching in the programming language. But in certain cases, where the number of different search conditions is small enough and the size of the table being queried is large enough, it may be best just to run the query thousands of times (one for each search condition) using placeholders [Hack #8] (which give you some optimization for free); in the example used in this hack, the following program turns out to be the fastest. It is also the most memory efficient:

my $sql = "SELECT 1 from weblog where ip = ?";
my $sth = $dbh->prepare($sql);
my $match;

foreach my $ip (keys %IP) {
  $sth->execute($ip);
  if ($sth->fetch) {
    $match = $ip;
    $sth->finish;
    last;
  }
  $sth->finish;
}

In our test environment, this runs in 0.6 seconds!

You might want to page through a result set, and perhaps return the first 10 rows and then the next 10 rows, and so on. You can use the LIMIT and OFFSET constructs to achieve this (see "Extract a Subset of the Results" [Hack #72]).

No matter what programming language, database system, or interface libraries you are using in your applications, you should take the time to run queries in a variety of querying styles. This will allow you to maximize your application's performance. If you don't, the advantages gained from writing efficient queries may be offset by poor library interface routine selection.

The difference between what is stored in the database and what you actually want in your result set also is important. If your queries tend to want huge chunks of your data, you will need a different approach than queries, which produce only a few rows. Almost always you will want to aim for queries that produce small result sets.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows