Feb. 6, 2011, 6:16 a.m.
posted by telumel
How you restrict your result set is one of the most critical factors that helps you determine which tactics to apply when writing an SQL statement. The collective criteria that filters the data are often seen as a motley assortment of conditions associated in the where clause. However, you should very closely examine the various where-clause (and having-clause, too) conditions when writing SQL code.
Meaning of Filtering Conditions
Given the syntax of the SQL language, it is quite natural to consider that all filtering conditions , as expressed in the where clause, are similar in nature. This is absolutely not the case. Some filtering conditions apply directly to the select operator of relational theory, where checking that a column in a row (purists would say an attribute in a relation variable) matches (or doesn't match) a given condition. However, historically the where clause also contains conditions that implement another operatorthe join operator. There is, since the advent of the SQL92 join syntax, an attempt to differentiate join filtering conditions, located between the (main) from clause and the where clause, from the select filtering conditions listed in the where clause. Joining two (or more) tables logically creates a new relation.
Consider this general example of a join:
select ..... from t1 inner join t2 on t1.join1 = t2.joind2 where ...
Should a condition on column c2 belonging to t2 come as an additional condition on the inner join, expressing that in fact you join on a subset of t2? Or should a condition inside the where clause, along with conditions on columns of t1, express that the filtering applies to the result of joining t1 to t2? Wherever you choose to place your join condition ought not to make much of a difference; however, it has been known to lead to variations in performance with some optimizers.
We may also have conditions other than joins and the simple filtering of values. For instance, we may have conditions restricting the returned set of rows to some subtype; we may also have conditions that are just required to check the existence of something inside another table. All these conditions are not necessarily semantically identical, although the SQL syntax makes all of them look equivalent. In some cases, the order of evaluation of the conditions is of no consequence; in other cases, it is significant.
Here's an example that you can actually find in more than one commercial software package to illustrate the importance of the order of the evaluation of conditions. Suppose that we have a parameters table, which holds: parameter_name, parameter_type, and parameter_value, with parameter_value being the string representation of whatever type of parameter we have, as defined by the attribute parameter_type. (To the logical mind this is indeed a story of more woe than that of Juliet and her Romeo, since the domain type of attribute parameter_value is a variable feast and thus offends a primary rule of relational theory.) Say that we issue a query such as:
select * from parameters where parameter_name like '%size' and parameter_type = 'NUMBER'
With this query, it does not matter whether the first condition is evaluated before or after the second one. However, if we add the following condition, where int( ) is a function to convert from char to integer value, then the order of evaluation becomes very significant:
and int(parameter_value) > 1000
Now, the condition on parameter_typemust be evaluated before the condition on the value, because otherwise we risk a run-time error consequent upon attempting to convert a character string (if for example parameter_type for that row is defined as char) to an integer. The optimizer may not be able to figure out that the poor design demands that one condition should have higher priorityand you may have trouble specifying it to the database.
All search criteria are not equal; some are more equal than others.
Evaluation of Filtering Conditions
Be aware, however, that some data (principally data used for joining tables) may be stored redundantly in several tables. A requirement to return values known to be held in the primary key of a given table doesn't necessarily mean that this table must appear in the from clause, since this primary key may well appear as the foreign key of another table from which we also need the data.
Even before writing a query, we should rank the filtering conditions. The really efficient ones (of which there may be several, and which may apply to different tables) will drive the query, and the inefficient ones will come as icing on the cake. What is the criterion that defines an efficient filter? Primarily, one that allows us to cut down the volume of the data we have to deal with as fast as possible. And here we must pay a lot of attention to the way we write; the following subsections work through a simple example to illustrate my point.
Buyers of Batmobiles
Assume that we have four tables, namely customers, orders, orderdetail, and a table of articles, as shown in Figure. Please note that in the figure the sizes of the boxes representing each table are more or less proportional to the volume of data in each table, not simply to the number of columns. Primary key columns are underlined.
A classical order schema
Let's now suppose that our SQL problem is to find the names of all the customers living in the city named "Gotham" who have ordered the article called "Batmobile" during the last six months. We have, of course, several ways to formulate this query; the following is probably what an ANSI SQL fan would write:
select distinct c.custname from customers c join orders o on o.custid = c.custid join orderdetail od on od.ordid = o.ordid join articles a on a.artid = od.artid where c.city = 'GOTHAM' and a.artname = 'BATMOBILE' and o.ordered >= somefunc
somefunc is supposed to be a function that returns the date six months prior to the current date. Notice too, the presence of distinct, which may be required if one of our customers is an especially heavy consumer of Batmobiles and has recently ordered several of them.
Let's forget for a while that the optimizer may rewrite the query, and look at the execution plan such a statement suggests. First, we walk the customers table, keeping only rows for which the city happens to be Gotham. Then we search the orders table, which means that the custid column there had better be indexed, because otherwise the only hope the SQL engine has of executing the query reasonably fast is to perform some sorting and merging or to scan the orders table to build a hash table and then operate against that. We are going to apply another filter at this level, against the order date. A clever optimizer will not mind finding the filtering condition in the where clause and will understand that in order to minimize the amount of data to join it must filter on the date before performing the join. A not so clever optimizer might be tempted to join first, and then filter, and may therefore be grateful to you for specifying the filtering condition with the join condition, as follows:
join orders o on o.custid = c.custid and a.ordered >= somefunc
Even if the filtering condition really has nothing to do with the join, it is sometimes difficult for the optimizer to understand when that is the case. If the primary key of orderdetail is defined as (ordid, artid) then, because ordid is the first attribute of the index, we can make use of that index to find the rows associated with an order as in Chapter 3. But if the primary key happens to be (artid, ordid) (and note, either version is exactly the same as far as relational theory is concerned), then tough luck. Some products may be able to make some use of the index[*] in that case, but it will not provide the efficient access that (ordid, artid) would have allowed. Other products will be totally unable to use the index. The only circumstance that may save us is the existence of a separate index on ordid.
Once we have linked orderdetails to orders, we can proceed to articleswithout any problem this time since we found artid, the primary key, in orderdetail. Finally, we can check whether the value in articles is or is not a Batmobile. Is this the end of the story? Not quite. As instructed by distinct, we must now sort the resulting set of customer names that have passed across all the filtering layers so as to eliminate duplicates.
select distinct c.custname from customers c, orders o, orderdetail od, articles a where c.city = 'GOTHAM' and c.custid = o.custid and o.ordid = od.ordid and od.artid = a.artid and a.artname = 'BATMOBILE' and o.ordered >= somefunc
It may just be old habits dying hard, but I prefer this older way, if only for one simple reason: it makes it slightly more obvious that from a logical point of view the order in which we process data is arbitrary, because the same data will be returned irrespective of the order in which we inspect tables. Certainly the customers table is particularly important, since that is the source from which we obtain the data that is ultimately required, while in this very specific context, all the other tables are used purely to support the remaining selection processes. One really has to understand that there is no one recipe that works for all cases. The pattern of table joins will vary for each situation you encounter. The deciding factor is the nature of the data you are dealing with.
A given approach in SQL may solve one problem, but make another situation worse. The way queries are written is a bit like a drug that may heal one patient but kill another.
More Batmobile purchases
Let's explore alternative ways to list our buyers of Batmobiles. In my view, as a general rule, distinct at the top level should be avoided whenever possible. The reason is that if we have overlooked a join condition, a distinct will hide the problem. Admittedly this is a greater risk when building queries with the older syntax, but nevertheless still a risk when using the ANSI/SQL92 syntax if tables are joined through several columns. It is usually much easier to spot duplicate rows than it is to identify incorrect data.
It's easy to give a proof of the assertion that incorrect results may be difficult to spot: the two previous queries that use distinct to return the names of the customers may actually return a wrong result. If we happen to have several customers named "Wayne," we won't get that information because distinct will not only remove duplicates resulting from multiple orders by the same customer, but also remove duplicates resulting from homonyms. In fact, we should return both the unique customer id and the customer name to be certain that we have the full list of Batmobile buyers. We can only guess at how long it might take to identify such an issue in production.
How can we get rid of distinct then? By acknowledging that we are looking for customers in Gotham that satisfy an existence test , namely a purchase order for a Batmobile in the past six months. Note that most, but not all, SQL dialects support the following syntax:
select c.custname from customers c where c.city = 'GOTHAM' and exists (select null from orders o, orderdetail od, articles a where a.artname = 'BATMOBILE' and a.artid = od.artid and od.ordid = o.ordid and o.custid = c.custid and o.ordered >= somefunc )
If we use an existence test such as this query uses, a name may appear more than once if it is common to several customers, but each individual customer will appear only once, irrespective of the number of orders they placed. You might think that my criticism of the ANSI SQL syntax was a little harsh, since customers figure as prominently, if not more prominently than before. However, it now features as the source for the data we want the query to return. And another query, nested this time, appears as a major phase in the identification of the subset of customers.
The inner query in the preceding example is strongly linked to the outer select. As you can see on line 11 (in bold), the inner query refers to the current row of the outer query. Thus, the inner query is what is called a correlated subquery. The snag with this type of subquery is that we cannot execute it before we know the current customer. Once again, we are assuming that the optimizer doesn't rewrite the query. Therefore we must first find each customer and then check for each one whether the existence test is satisfied. Our query as a whole may perform excellently if we have very few customers in Gotham. It may be dreadful if Gotham is the place where most of our customers are located (a case in which a sophisticated optimizer might well try to execute the query in a different way).
We have still another way to write our query, which is as follows:
select custname from customers where city = 'GOTHAM' and custid in (select o.custid from orders o, orderdetail od, articles a where a.artname = 'BATMOBILE' and a.artid = od.artid and od.ordid = o.ordid and o.ordered >= somefunc)
In this case, the inner query no longer depends on the outer query: it has become an uncorrelated subquery. It needs to be executed only once. It should be obvious that we have now reverted the flow of execution. In the previous case, we had to search first for customers in the right location (e.g., where city is Gotham), and then check each order in turn. In this latest version of the query, the identifiers of customers who have ordered what we are looking for are obtained via a join that takes place in the inner query.
If you have a closer look, however, there are more subtle differences as well between the current and preceding examples. In the case of the correlated subquery, it is of paramount importance to have the orders table indexed on custid; in the second case, it no longer matters, since then the index (if any) that will be used is the index associated with the primary key of customers.
You might notice that the most recent version of the query performs an implicit distinct. Indeed, the subquery, because of its join, might return many rows for a single customer. That duplication doesn't matter, because the in condition checks only to see whether a value is in the list returned by the subquery, and in doesn't care whether a given value is in that list one time or a hundred times. Perhaps though, for the sake of consistency we should apply the same rules to the subquery that we have applied to the query as a whole, namely to acknowledge that we have an existence test within the subquery as well:
select custname from customers where city = 'GOTHAM' and custid in (select o.custid from orders o where o.ordered >= somefunc and exists (select null from orderdetail od, articles a where a.artname = 'BATMOBILE' and a.artid = od.artid and od.ordid = o.ordid))
select custname from customers where city = 'GOTHAM' and custid in (select custid from orders where ordered >= somefunc and ordid in (select od.ordid from orderdetail od, articles a where a.artname = 'BATMOBILE' and a.artid = od.artid)
Irrespective of the fact that our nesting is getting deeper and becoming less legible, choosing which query is the best between the exists and the in follows the very same rule inside the subquery as before: the choice depends on the effectiveness of the condition on the date versus the condition on the article. Unless business has been very, very slow for the past six months, one might reasonably expect that the most efficient condition on which to filter the data will be the one on the article name. Therefore, in the particular case of the subquery, in is better than exists because it will be faster to find all the order lines that refer to a Batmobile and then to check whether the sale occurred in the last six months rather than the other way round. This approach will be faster assuming that the table orderdetail is indexed on artid; otherwise, our bright, tactical move will fail dismally.
Most SQL dialects allow you to rewrite uncorrelated subqueries as inline views in the from clause. However, you must always remember that an in performs an implicit removal of duplicate values, which must become explicit when the subquery is moved to become an in-line view in the from clause. For example:
select custname from customers where city = 'GOTHAM' and custid in (select o.custid from orders o, (select distinct od.ordid from orderdetail od, articles a where a.artname = 'BATMOBILE' and a.artid = od.artid) x where o.ordered >= somefunc and x.ordid = o.ordid)
The different ways you have to write functionally equivalent queries (and variants other than those given in this section are possible) are comparable to words that are synonyms. In written and spoken language, synonyms have roughly the same meaning, but each one introduces a subtle difference that makes one particular word more suitable to a particular situation or expression (and in some cases another synonym is totally inappropriate). In the same way, both data and implementation details may dictate the choice of one query variant over others.
Lessons to be learned from the Batmobile trade
The various examples of SQL that you saw in the preceding section may look like an idle exercise in programming dexterity, but they are more than that. The key point is that there are many different ways in which we can attack the data, and that we don't necessarily have to go first through customers, then orders, then orderdetail, and then articles as some of the ways of writing the query might suggest.
If we represent the strength of our search criteria with arrowsthe more discriminant the criterion, the larger the arrowwe can assume that we have very few customers in Gotham, but that we sell quite a number of Batmobiles and business has been brisk for the past six months, in which case our battle map may look like Figure. Although we have a condition on the article name, the medium arrow points to orderdetail because that is what truly matters. We may have very few articles for sale, which may represent similar percentages of our revenue, or we may have a huge number of articles, of which one of the best sellers is the Batmobile.
When query discrimination is based on location
Alternatively, we can assume that most of our customers are indeed based in Gotham, but that very few actually buy Batmobiles, in which case our battle plan will look more like Figure. It is quite obvious then, that we really have to cut to pieces the orderdetail table, which is the largest one. The faster we slash this table, the faster our query will run.
When query discrimination is based on purchase
Note alsoand this is a very important pointthat the criterion "during the last six months" is not a very precise one. But what if we change the criterion to specify the last two months and happen to have 10 years of sales history online? In that case, it may be more efficient to get to those recent orders firstwhich, thanks to some techniques described in Chapter 5, may be clustered togetherand then start from there, selecting customers from Gotham, on the one hand, and orders for Batmobiles on the other. To put it another way, the best execution plan does not only depend on the data values, it may also evolve over time.
What then can we conclude from all this? First, that there is more than one way to skin a cat...and that an expression of a query is usually associated with implicit assumptions about the data. With each different expression of a query we will obtain the same result set, but it may be at significantly different speeds. The way we write the query may influence the execution path, especially when we have to apply criteria that cannot be expressed within the truly relational part of the environment. If the optimizer is to be allowed to function at its best, we must try to maximize the amount of true relational processing and ensure the non-relational component has minimum impact on the final result.
We have assumed all along in this chapter that statements will be run as suggested by the way they are written. Be aware though, that an optimizer may rewrite queriessometimes pretty aggressively. You could argue that rewrites by the optimizer don't matter, because SQL is supposed to be a declarative language in which you state what you want and let the DBMS provide it. However, you have seen that each time we have rewritten a query in a different way, we have had to change assumptions about the distribution of data and about existing indexes. It is highly important, therefore, to anticipate the work of the optimizer to be certain that it will find what it needs, whether in terms of indexes or in terms of detailed-enough statistical information about the data.
The correct result from an SQL statement is only the first step in building the best SQL.
Querying Large Quantities of Data
It may sound obvious, but the sooner we get rid of unwanted data, the less we have to process at later stages of a queryand the more efficiently the query will run. An excellent application of this principle can be found with set operators, of which union is probably the most widely used. It is quite common to find in a moderately complex union a number of tables appearing in several of the queries "glued" together with the union operator. One often sees the union of fairly complex joins, with most of the joined tables occurring in both select statements of the unionfor example, on both sides of the union, something like the following:
select ... from A, B, C, D, E1 where (condition on E1) and (joins and other conditions) union select ... from A, B, C, D, E2 where (condition on E2) and (joins and other conditions)
This type of query is typical of the cut-and-paste school of programming. In many cases it may be more efficient to use a union of those tables that are not common, complete with the screening conditions, and to then push that union into an inline view and join the result, writing something similar to:
select ... from A, B, C, D, (select ... from E1 where (condition on E1) union select ... from E2 where (condition on E2)) E where (joins and other conditions)
Another classic example of conditions applied at the wrong place is a danger associated with filtering when a statement contains a group by clause. You can filter on the columns that define the grouping, or the result of the aggregate (for instance when you want to check whether the result of a count( ) is smaller than a threshold) or both. SQL allows you to specify all such conditions inside the having clause that filters after the group by (in practice, a sort followed by an aggregation) has been completed. Any condition bearing on the result of an aggregate function must be inside the having clause, since the result of such a function is unknown before the group by. Any condition that is independent on the aggregate should go to the where clause and contribute to decrease the number of rows that we shall have to sort to perform the group by.
Let's return to our customers and orders example, admitting that the way we process orders is rather complicated. Before an order is considered complete, we have to go through several phases that are recorded in the table orderstatus, of which the main columns are ordid, the identifier of the order; status; and statusdate, which is a timestamp. The primary key is compound, consisting of ordid, and statusdate. Our requirement is to list, for all orders for which the status is not flagged as complete (assumed to be final), the identifier of the order, the customer name, the last known order status, and when this status was set. To that end, we might build the following query, filtering out completed orders and identifying the current status as the latest status assigned:
select c.custname, o.ordid, os.status, os.statusdate from customers c, orders o, orderstatus os where o.ordid = os.ordid and not exists (select null from orderstatus os2 where os2.status = 'COMPLETE' and os2.ordid = o.ordid) and os.statusdate = (select max(statusdate) from orderstatus os3 where os3.ordid = o.ordid) and o.custid = c.custid
At first sight this query looks reasonable, but in fact it contains a number of deeply disturbing features. First, notice that we have two subqueries, and notice too that they are not nested as in the previous examples, but are only indirectly related to each other. Most worrying of all, both subqueries hit the very same table, already referenced at the outer level. What kind of filtering condition are we providing? Not a very precise one, as it only checks for the fact that orders are not yet complete.
How can such a query be executed? An obvious approach is to scan the orders table, for each row checking whether each order is or is not complete. (Note that we might have been happy to find this information in the orders table itself, but this is not the case.) Then, and only then, can we check the date of the most recent status, executing the subqueries in the order in which they are written.
The unpleasant fact is that both subqueries are correlated. Since we have to scan the orders table, it means that for every row from orders we shall have to check whether we encounter the status set to COMPLETE for that order. The subquery to check for that status will be fast to execute, but not so fast when repeated a large number of times. When there is no COMPLETE status to be found, then a second subquery must be executed. What about trying to un-correlate queries?
The easiest query to un-correlate happens to be the second one. In fact, we can write, at least with some SQL dialects:
and (o.ordid, os.statusdate) = (select ordid, max(statusdate) from orderstatus group by ordid)
The subquery that we have now will require a full scan of orderstatus; but that's not necessarily bad, and we'll discuss our reasoning in a moment.
There is something quite awkward in the condition of the pair of columns on the left-hand side of the rewritten subquery condition. These columns come from different tables, and they need not do so. In fact, we want the order identifier to be the same in orders and orderstatus; will the optimizer understand the subtlety of this situation? That is rather uncertain. If the optimizer doesn't understand, then it will be able to execute the subquery first, but will have to join the two other tables together before being able to exploit the result of the subquery. If the query were written slightly differently, the optimizer would have greater freedom to decide whether it actually wants to do what I've just described or exploit the result of the subquery and then join orders to orderstatus:
and (os.ordid, os.statusdate) = (select ordid, max(statusdate) from orderstatus group by ordid)
The reference on the left side to two columns from the same table removes the dependence of identification of the most recent status for the order on a preliminary join between orderstatus and orders. A very clever optimizer might have performed the modification for us, but it is wiser to take no risk and specify both columns from the same table to begin with. It is always much better to leave the optimizer with as much freedom as we can.
You have seen previously that an uncorrelated subquery can become a join in an inline view without much effort. We can indeed rewrite the entire query to list pending orders as follows:
select c.custname, o.ordid, os.status, os.statusdate from customers c, orders o, orderstatus os, (select ordid, max(statusdate) laststatusdate from orderstatus group by ordid) x where o.ordid = os.ordid and not exists (select null from orderstatus os2 where os2.status = 'COMPLETE' and os2.ordid = o.ordid) and os.statusdate = x.laststatusdate and os.ordid = x.ordid and o.custid = c.custid
But then, if COMPLETE is indeed the final status, do we need the subquery to check the nonexistence of the last stage? The inline view helps us to identify which is the last status, whether it is COMPLETE or anything else. We can apply a perfectly satisfactory filter by checking the latest known status:
select c.custname, o.ordid, os.status, os.statusdate from customers c, orders o, orderstatus os, (select ordid, max(statusdate) laststatusdate from orderstatus group by ordid) x where o.ordid = os.ordid and os.statusdate = x.laststatusdate and os.ordid = x.ordid and os.status != 'COMPLETE' and o.custid = c.custid
The duplicate reference to orderstatus can be further avoided by using OLAP or analytical functions available with some SQL engines. But let's pause here and consider how we have modified the query and, more importantly, the execution path. Basically, our natural path was initially to scan the orders table, and then access through what may reasonably be expected to be an efficient index on the table orderstatus. In the last version of our query, we will attack through a full scan of orderstatus, to perform a group by. In terms of the number of rows, orderstatus will necessarily be several times bigger than orders. However, in terms of mere volume of data to scan, we can expect it to be smaller, possibly significantly smaller, depending on how much information is stored for each order.
We cannot say with certainty which approach will be better, it depends on the data. Let me add that seeing a full scan on a table that is expected to grow is not a good idea (restricting the search to the last month's, or last few months' worth of data can help). But there are significant chances that this last version of our query will perform better than the first attempt with the subquery in the where clause.
We cannot leave the subject of large data volumes without mentioning a slightly special case. When a query returns a very large amount of data, you have reasonable grounds for suspecting that it's not an individual sitting at a terminal that executed the query. The likelihood is that such a query is part of a batch process. Even if there is a longish "preparatory phase," nobody will complain so long as the whole process performs to a satisfactory standard. Do not, of course, forget that a phase, preparatory or not, requires resourcesCPU, memory, and possibly temporary disk space. It helps to understand that the optimizer, when returning a lot of data, may choose a path which has nothing in common with the path it would adopt when returning few rows, even if the fundamental query is identical.
Filter out unneeded data as early as possible.
The Proportions of Retrieved Data
A typical and frequently quoted saying is the famous "don't use indexes when your query returns more than 10% of the rows of a table." This states implicitly that (regular) indexes are efficient when an index key points to 10% or less of the rows in a table. As I have already pointed out in Chapter 3, this rule of thumb dates back to a time when relational databases were still regarded with suspicion in many companies. In those days, their use was mostly confined to that of departmental databases. This was a time when a 100,000-row table was considered a really big one. Compared to 10% of a 500 million-row table, 10% of 100,000 rows is a trifle. Can we seriously hope that the best execution plan in one case will still be the best execution plan in the other case? Such is wishful thinking.
Independently from the evolution of table sizes since the time when the "10% of rows" rule of thumb was first coined, be aware that the number of rows returned means nothing in itself, except in terms of response time expectations by end users. If you compute an average value over 1 billion rows, you return a single row, and yet the DBMS performs a lot of work. Even without any aggregation, what matters is the number of data pages that the DBMS is going to hit when performing the query. Data page hits don't only depend on the existence of indexes: as you saw in Chapter 3, the relation of indexes to the physical order of rows in the table can make a significant difference in the number of pages to visit. Other implementation issues that I am going to discuss in Chapter 5 play an important part, too: depending on how data is physically stored, the same number of rows returned may mean that you have to visit massively different numbers of data pages. Furthermore, operations that would execute sequentially with one access path may be massively parallelized with another one. Don't fall into the row percentage trap.
When we want a lot of data, we don't necessarily want an index.