Jan. 22, 2011, 9:20 p.m.
posted by telumel
When our spies (whether they are users or monitoring facilities) have directed our attention to a number of SQL statements, we need to inspect these statements more closely. Scrutinizing execution plans is one of the favorite activities of many SQL tuners, if we are to believe the high number of posts in forums or mailing lists in the form of "I have a SQL query that is particularly slow; here is the execution plan...."
Execution plans are usually displayed either as an indented list of the various steps involved in the processing of a (usually complex) SQL statements, or under a graphical form, as in Figure. This figure displays the execution plan for one of the queries from Chapter 7. Text execution plans are far less sexy but are easier to post on forums, which must account for the enduring popularity of such plans. Knowing how to correctly read and interpret an execution plan, whether it is represented graphically or as text, is in itself a valued skill.
A DB2 execution plan
So far in this book, I have had very little to say on the topic of execution plans, except for a couple of examples presented here and there without any particular comment. Execution plans are tools, and different individuals have different preferences for various tools; you are perfectly allowed to have a different opinion, but I usually attach a secondary importance to execution plans. Some developers consider execution plans as the ultimate key to the understanding of performance issues. Two real-life examples will show that one may have some reasons to be less than sanguine about using execution plans as the tool of choice for improving a query.
Identifying the Fastest Execution Plan
In this section, I am going to test your skills as an interpreter of execution plans. I'm going to show three execution plans and ask you to choose which is the fastest. Ready? Go, and good luck!
The following execution plans show how three variants of the same query are executed:
Our battle field
The result set of the query consists of 860 rows, and the four following tables are involved:
All tables are heavily indexed, no index was created, dropped or rebuilt, and no change was applied to the data structures. Only the text of the query changed between plans, and optimizer directives were sometimes applied.
Consider the three execution plans, try to rank them in order of likely speed, and if you feel like it you may even venture an opinion about the improvement factor.
And the winner is.. .
The answer is that Plan 1 took 27 seconds, Plan 2 one second, and Plan 3 (the initial execution plan of the query) one minute and 12 seconds. You will be forgiven for choosing the wrong plan. In fact, with the information that I provided, it would be sheer luck for you to have correctly guessed at the fastest plan (or the result of a well-founded suspicion that there must be a catch somewhere). You can take note that the slowest execution plan is by far the shortest, and that it contains no reference to anything other than indexed accesses. By contrast, Plan 1 demonstrates that you can have two full scans of the same table and yet execute the query almost three times faster than a shorter, index-only plan such as Plan 3.
The point of this exercise was to demonstrate that the length of an execution plan is not very meaningful, and that exclusive access to tables through indexes doesn't guarantee that performance is the best you can achieve. True, if you have a 300-line plan for a query that returns 19 rows, then you might have a problem, but you mustn't assume that shorter is better.
Forcing the Right Execution Plan
The second example is the weird behavior exhibited by one query issued by a commercial off-the-shelf software package. When run against one database, the query takes 4 minutes, returning 40,000 rows. Against another database, running the same version of the same DBMS, the very same query responds in 11 minutes on comparable hardware although all tables involved are much smaller. The comparison of execution plans shows that they are wildly different. Statistics are up-to-date on both databases, and the optimizer is instructed to use them everywhere. The question immediately becomes one of how to force the query to take the right execution path on the smaller database. DBAs are asked to do whatever is in their power to get the same execution plan on both databases. The vendor's technical team works closely with the customer's team to try to solve the problem.
A stubborn query
Following is the text of the query,[*] followed by the plan associated to the fastest execution. Take note that the good plan only accesses indexes, not tables:
select o.id_outstanding, ap.cde_portfolio, ap.cde_expense, ap.branch_code, to_char(sum(ap.amt_book_round + ap.amt_book_acr_ad - ap.amt_acr_nt_pst)), to_char(sum(ap.amt_mnl_bk_adj)), o.cde_outstd_typ from accrual_port ap, accrual_cycle ac, outstanding o, deal d, facility f, branch b where ac.id_owner = o.id_outstandng and ac.id_acr_cycle = ap.id_owner and o.cde_outstd_typ in ('LOAN', 'DCTLN', 'ITRLN', 'DEPOS', 'SLOAN', 'REPOL') and d.id_deal = o.id_deal and d.acct_enabl_ind = 'Y' and (o.cde_ob_st_ctg = 'ACTUA' or o.id_outstanding in (select id_owner from subledger)) and o.id_facility = f.id_facility and f.branch_code = b.branch_code and b.cde_tme_region = 'ZONE2' group by o.id_outstanding, ap.cde_portfolio, ap.cde_expense, ap.branch_code, o.cde_outstd_typ having sum(ap.amt_book_round + ap.amt_book_acr_ad - ap.amt_acr_nt_pst) <> 0 or (sum(ap.amt_mnl_bk_adj) is not null and sum(ap.amt_mnl_bk_adj) <> 0)
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE 1 0 FILTER 2 1 SORT (GROUP BY) 3 2 FILTER 4 3 HASH JOIN 5 4 HASH JOIN 6 5 HASH JOIN 7 6 INDEX (FAST FULL SCAN) OF 'XDEAUN08' (UNIQUE) 8 6 HASH JOIN 9 8 NESTED LOOPS 10 9 INDEX (FAST FULL SCAN) OF 'XBRNNN02' (NON-UNIQUE) 11 9 INDEX (RANGE SCAN) OF 'XFACNN05' (NON-UNIQUE) 12 8 INDEX (FAST FULL SCAN) OF 'XOSTNN06' (NON-UNIQUE) 13 5 INDEX (FAST FULL SCAN) OF 'XACCNN05' (NON-UNIQUE) 14 4 INDEX (FAST FULL SCAN) OF 'XAPONN05' (NON-UNIQUE) 15 3 INDEX (SKIP SCAN) OF 'XBSGNN03' (NON-UNIQUE)
The addition of indexes to the smaller database leads nowhere. Existing indexes were initially identical on both databases, and creating different indexes on the smaller database brings no change to the execution plan. Three weeks after the problem was first spotted, attention is now turning to disk striping, without much hope. Constraining optimizer directives are beginning to look unpleasantly like the only escape route.
Before using directives, it is wise to have a fair idea of the right angle of attack. Finding the proper angle, as you have seen in Chapters 4 and 6, requires an assessment of the relative precision of the various input criteria, even though in this case the reasonably large result set (of some 40,000 rows on the larger database and a little over 3,000 on the smaller database) gives us little hope of seeing one criterion coming forward as the key criterion.
Study of search criteria
When we use as the only criterion the condition on what looks like a time zone, the query returns 17% more rows than with all filtering conditions put together, but it does it blazingly fast:
SQL> select count(*) "FAC" 2 from outstanding 3 where id_facility in (select f.id_facility 4 from facility f, 5 branch b 6 where f.branch_code = b.branch_code 7 and b.cde_tme_region = 'ZONE2'); FAC ---------- 55797 Elapsed: 00:00:00.66
The flag condition alone filters three times our number of rows, but it does it very fast, too:
SQL> select count(*) "DEA" 2 from outstanding 3 where id_deal in (select id_deal 4 from deal 5 where acct_enabl_ind = 'Y'); DEA ---------- 123970 Elapsed: 00:00:00.63
What about our or condition on the outstanding table? Following are the results from that condition:
SQL> select count(*) "ACTUA/SUBLEDGER" 2 from outstanding 3 where (cde_ob_st_ctg = 'ACTUA' 4 or id_outstanding in (select id_owner 5 from subledger)); ACTUA/SUBLEDGER --------------- 32757 Elapsed: 00:15:00.64
Looking at these results, it is clear that we have pinpointed the problem. This or condition causes a huge increase in the query's execution time.
The execution plan for the preceding query shows only index accesses:
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE 1 0 SORT (AGGREGATE) 2 1 FILTER 3 2 INDEX (FAST FULL SCAN) OF 'XOSTNN06' (NON-UNIQUE) 4 2 INDEX (SKIP SCAN) OF 'XBSGNN03' (NON-UNIQUE)
Notice that both index accesses are not exactly the usual type of index descent; there is no need to get into arcane details here, but a FAST FULL SCAN is in fact the choice of using the smaller index rather than the larger associated table to perform a scan, and the choice of a SKIP SCAN comes from a similar evaluation by the optimizer. In other words, the choice of the access method is not exactly driven by the evidence of an excellent path, but proceeds from a kind of "by and large, it should be better" optimizer assessment. If the execution time is to be believed, a SKIP SCAN is not the best of choices.
Let's have a look at the indexes on outstanding (the numbers of distinct index keys and distinct column values are estimates, which accounts for the slightly inconsistent figures). Indexes in bold are the indexes that appear in the execution plan:
INDEX_NAME DIST KEYS COLUMN_NAME DIST VAL --------------------- ---------- ------------------ -------- XOSTNC03 25378 ID_DEAL 1253 ID_FACILITY 1507 XOSTNN05 134875 ID_OUTSTANDING 126657 ID_DEAL 1253 IND_AUTO_EXTND 2 CDE_OUTSTD_TYP 5 ID_FACILITY 1507 UID_REC_CREATE 161 NME_ALIAS 126657 XOSTNN06 ID_OUTSTANDING 126657 CDE_OUTSTD_TYP 5 ID_DEAL 1253 CDE_OB_ST_CTG 3 ID_FACILITY 1507 XOSTUN01 (U) 121939 ID_OUTSTANDING 126657 XOSTUN02 (U) 111055 NME_ALIAS 126657
The other index (xbsgnn03) is associated with subledger:
INDEX_NAME DIST KEYS COLUMN_NAME DIST VAL --------------------- ---------- ------------------ -------- XBSGNN03 101298 BRANCH_CODE 8 CDE_PORTFOLIO 5 CDE_EXPENSE 56 ID_OWNER 52664 CID_CUSTOMER 171 XBSGNN04 59542 ID_DEAL 4205 ID_FACILITY 4608 ID_OWNER 52664 XBSGNN05 49694 BRANCH_CODE 8 ID_FACILITY 4608 ID_OWNER 52664 XBSGUC02 (U) 147034 CDE_GL_ACCOUNT 9 CDE_GL_SHTNAME 9 BRANCH_CODE 8 CDE_PORTFOLIO 5 CDE_EXPENSE 56 ID_OWNER 52664 CID_CUSTOMER 171 XBSGUN01 (U) 134581 ID_SUBLEDGER 154362
As is too often the case with COTS packages, we have here an excellent example of carpet-indexing.
The indexes on outstanding raise a couple of questions.
Modifying indexes is a delicate business that requires a careful study of all the possible side-effects. We have here a number of questionable indexes, but we also have an urgent problem to solve. Let's therefore refrain from making any changes to the existing indexes and concentrate on the SQL code.
As the numbers of distinct keys of our unique indexes show, we are not dealing here with large tables; and in fact the two other criteria we have tried to apply to outstanding both gave excellent response times, in spite of being rather weak criteria. The pathetic result we have with the or construct results from an attempt to merge data which was painfully extracted from the two indexes. Let's try something else:
SQL> select count(*) "ACTUA/SUBLEDGER" 2 from (select id_outstanding 3 from outstanding 4 where cde_ob_st_ctg = 'ACTUA' 5 union 6 select o.id_outstanding 7 from outstanding o, 8 subledger sl 9 where o.id_outstanding = sl.id_owner) 10 / ACTUA/SUBLEDGER --------------- 32757 Elapsed: 00:00:01.82
No change to the indexes, and yet the optimizer suddenly sees the light even if we hit the table outstanding twice. Execution is much, much faster now.
Replacing the "problem condition" and slightly reshuffling some of the other remaining conditions, cause the query to run in 13 seconds where it used to take 4 minutes (reputedly the "good case"); and only 3.4 seconds on the other database, where it used to take 11 minutes to return 3,200 rows.
A moral to the story
It is likely that a more careful study and some work at the index level would allow the query to run considerably faster than 13 seconds. On the other hand, since everybody appeared to be quite happy with 4 minutes, 13 seconds is probably a good enough improvement.
What is fascinating in this true story (and many examples in this book are taken from real life), is how the people involved focused (for several weeks) on the wrong issue. There was indeed a glaring problem on the smaller database. The comparison of the two different execution plans led to the immediate conclusion that the execution plan corresponding to the slower execution was wrong (true) and therefore, implicitly, that the execution plan corresponding to the faster execution was right (false). This was a major logical mistake, and it misled several people into concentrating on trying to reproduce a bad execution plan instead of improving the query.
I must add a final note as a conclusion to the story. Once the query has been rewritten, the execution plan is still different on the two databasesa situation that, given the discrepancy of volumes, only proves that the optimizer is doing its job.
The only yardstick of query performance is how long one takes to run, not whether the execution plan conforms to prejudices.