Convert Subqueries to JOINs






Convert Subqueries to JOINs

Sometimes you want to query one table, use that result to query another table, and then use that result to query yet another table. It's tempting to do this as three separate queries, but the right solution is to chain them yourself into one SQL statement.

Consider a database where employees have job titles, and job titles have ranks, and a rank has a salary, as shown in Figure, Figure, and Figure.

The jobs table
EmployeeTitle
Gordon RussellLecturer
Andrew CummingTeaching fellow
Jim SmithTechnician


The ranks table
TitleRank
LecturerLECT1
Teaching fellowLECT2
TechnicianTECH1


The salary table
RankPayment
LECT12000.00
LECT23000.00
TECH15000.00
TECH26000.00


Determining how much to pay Andrew Cumming would require three steps. First, you'd need to determine Andrew's title:

mysql> SELECT title FROM jobs WHERE employee = 'Andrew Cumming';
+-----------------+
| title           |
+-----------------+
| Teaching Fellow |
+-----------------+

Next, you'd need to determine the pay rank for a teaching fellow:

mysql> SELECT rank FROM ranks WHERE title = 'Teaching Fellow';
+-------+
| rank  |
+-------+
| LECT2 |
+-------+

Finally, you'd need to look up the salary for someone at the LECT2 pay grade:

mysql> SELECT payment FROM salary WHERE rank = 'LECT2';
+---------+
| payment |
+---------+
| 3000.00 |
+---------+

That's not efficient, because you'd need to pass three different queries to the database and process the results in between. If a table is updated during this process the answer might be wrong, or the query might even return an error. Combining queries can make people nervous. Nervous programmers often use subqueries:

mysql> SELECT payment FROM salary WHERE rank =
    ->   (SELECT rank FROM ranks WHERE title =
    ->     (SELECT title FROM jobs WHERE employee = 'Andrew Cumming'));
+---------+
| payment |
+---------+
| 3000.00 |
+---------+

The preceding code is good in that you have reduced the problem to a single query and thus removed many of the overhead problems, but subquery statements can be slow. When your subquery statements contain no aggregate functions (such as MAX( )), chances are you don't need a subqueryyou need a JOIN. If you have a working subquery arrangement, follow these steps to make it a JOIN:

  1. Mark all columns with the table name they come from.

  2. If you use the same table in two different FROM clauses, use aliases (not needed in this example).

  3. Move all FROM statements together to form a single FROM.

  4. Delete the occurrences of (Select.

  5. Substitute WHERE for AND after the first occurrence of WHERE.

Here's an intermediate stage:

SELECT payment FROM salary,ranks,jobs WHERE salary.rank = 
  (Select ranks.rank  from grades AND ranks.title = 
    (Select jobs.title from jobs AND jobs.employee = 'Andrew Cumming'))

Ultimately, you get this:

SELECT payment FROM salary,ranks,jobs
WHERE  salary.rank = ranks.rank
AND    ranks.title = jobs.title
AND    jobs.employee = 'Andrew Cumming'

Another approach is to take the conditions inside the subquery statements and make them JOIN ON conditions:

SELECT payment
FROM salary JOIN ranks ON (salary.rank = ranks.rank)
     JOIN jobs ON (ranks.title = jobs.title)
WHERE jobs.employee = 'Andrew Cumming'

Looking for What's Not There

Often programmers can handle this join-instead-of-subquery approach for inclusive matches, but it starts to get a little shakier with exclusive matches (looking for things that do not exist). For instance, how do you determine whether there are any ranks not currently allocated to a title? The brute force approach is to query the database for all ranks from the salary table, and then query each one in the ranks table. Needless to say, such an approach would result in bad performance. The next general approach is to use a subquery with NOT IN, but this too may not perform well:

mysql> SELECT salary.rank FROM salary
    -> WHERE rank NOT IN (SELECT rank FROM ranks);
+-------+
| rank  |
+-------+
| TECH2 |
+-------+

These may be performance losers because the subquery will likely be executed first, creating an intermediate temporary table in the database. This temporary table is then used to solve the outer query. However, in creating the temporary table, all indexes that may have existed on salary won't be used, and the database will have to perform a full scan on the temporary table.

The query is the opposite of the nested subquery problem considered earlier, because now you want to find nonmatching rows between tables. Strangely, trying the earlier technique but using != rather than = results in a huge mess of results which don't mean anything. Instead, you need to rely on OUTER JOIN. Put all tables required into a single FROM clause using an OUTER JOIN to link the tables. You are looking for things in salary which are not in ranks, and with OUTER JOIN the rows that don't match will have NULL values for ranks.rank:

mysql> SELECT salary.rank
    -> FROM salary LEFT OUTER JOIN ranks ON (salary.rank = ranks.rank)
    -> WHERE ranks.rank IS NULL;
+-------+
| rank  |
+-------+
| TECH2 |
+-------+

You also can use this technique to eliminate EXISTS and NOT EXISTS. Without the subquery, the optimizer finds it much easier to use your indexes.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows