April 5, 2011, 6:23 p.m.
posted by blackhat
Three-Way Joins and Beyond
When more than two tables are joined, each join is scheduled and processed separately—there is no such thing as a nested-loop within a nested-loop, even though that might seem to be the obvious strategy. Partly that's because some joins can be processed in parallel (for example, Oracle does in-parallel joining), and partly it's because a nested-loop within a nested-loop would exacerbate the problem that simple joins have (that is, that the innermost table wouldn't stay in cache). Consider this SQL statement:
SELECT * FROM Table1, Table2, Table3 WHERE Table1.column1 = Table2.column1 AND Table2.column1 = Table3.column1
For a three-way join of Table1, Table2, and Table3, the strategy is to merge Table1 and Table2 to a single (Table1_2) table, and then merge (Table1_2) with Table3. The question—Which expression is evaluated first?—is answered in the usual way: The expression with the most points goes first (see Chapter 2, "Simple Searches"). After that, though, some shuffling occurs that depends on the closeness of the second expression to be evaluated to the first expression—so the expression with the Chapter 2, "Simple Searches"—but keep in mind that, when joins are involved, those rules are no more than a weak hint.
You can expect the DBMS optimizer to start going wrong if five or more joins are in the query (until recently Microsoft's admitted limit was four). You can get around such restrictions by making temporary tables from five-table joins and then joining the temporary tables, but the fact is that your guess will be no better than that of your DBMS vendor. For a multiway join, finding the shortest path is reminiscent of a traveling salesman problem and becomes more and more complex as you add tables.
For Microsoft and Sybase, there's a common recommendation: Add a redundant WHERE clause so that the optimizer will have more choices about paths. Here's an example:
SELECT * FROM Table1, Table2, Table3 WHERE Table1.column1 = Table2.column1 AND Table2.column1 = Table3.column1 AND Table1.column1 = Table3.column1 /* the final AND is redundant but put it in anyway */
The thinking here is: If the optimizer is good, it will throw away the redundant clause, but if the optimizer is bad, it will benefit from the redundant clause because it now has a choice of all six possible join orders—including Table1 to Table3, and Table3 to Table1—which it wouldn't otherwise notice.
When we added the redundant WHERE clause to our example, we had to choose whether to add Expression #1 or Expression #2:
Expression #1: ... WHERE Table1.column1 = Table3.column1 Expression #2: ... WHERE Table3.column1 = Table1.column1
The best way to decide how to phrase a redundant WHERE clause is to figure out which table will be the outer table in your first join and base your decision on that. (You lose this choice if you use a USING or an ON clause.) Don't try to change clauses if you have a working query that contains an outer join.