Turning Data Around

Turning Data Around

The most common difficulty that you may encounter when trying to solve SQL problems is when you have to program against what might charitably be called an "unconventional" design. Writing a query that performs well is often the most visible challenge. However, I must underline that the complex SQL queries that are forced upon developers by a poor design only mirror the complication of programs (including triggers and stored procedures) that the same poor design requires in order to perform basic operations such as integrity checking. By contrast, a sound design allows you to declare constraints and let the DBMS check them for you, removing much of the risk associated with complexity. After all, ensuring data integrity is exactly what a DBMS, a rather fine piece of software, has been engineered to achieve. Unfortunately, haphazard designs will force you to spend days coding application controls. As a bonus, you get very high odds of letting software bugs creep in. Unlike popular software systems that are in daily use by millions of users, where bugs are rapidly exposed and fixed, your home-grown software can hide bugs for weeks or months before they are discovered.

Rows That Should Have Been Columns

Rows that should have been originally specified as columns are most often encountered with that appalling "design" having the magical four attributes--entity_id, attribute_name, attribute_type, attribute_value--that are supposed to solve all schema evolution issues. Frighteningly, many supporters of this model seem to genuinely believe that it represents the ultimate sophistication in terms of normalization. You will find it under various, usually flattering, namessuch as meta-design , or fact dimension with data warehouse designers.

Proponents of the magical four attributes praise the "flexibility" of this model. There is an obvious confusion of flexibility with flabbiness. Being able to add "attributes" on the fly is not flexibility; those attributes need to be retrieved and processed meaningfully. The dubious benefit of inserting rows instead of painstakingly designing the database in the first place is absolutely negligible compared to the major coding effort that is required, first, to process those new rows, and second, to insure some minimal degree of integrity and data consistency. The proper way to deal with varying numbers of attributes is to define subtypes, as explained in Chapter 1. Subtypes let you define clean referential integrity constraintschecks that you will not need to code and maintain. A database should not be a mere repository where data is dumped without any thought to its semantic integrity.

The predominant characteristic of queries against meta-design tables, as tables designed around our magical four attributes are sometimes called, is that you find the same table invoked a very high number of times in the from clause. Typically, queries will resemble something like:

select emp_last_name.entity_id        employee_id,
       emp_last_name.attribute_value  last_name,
       emp_first_name.attribute_value first_name,
       emp_job.attribute_value        job_description,
       emp_dept.attribute_value       department,
       emp_sal.attribute_value        salary
from employee_attributes emp_last_name,
     employee_attributes emp_first_name,
     employee_attributes emp_job,
     employee_attributes emp_dept,
     employee_attributes emp_sal
where emp_last_name.entity_id = emp_first_name.entity_id
  and emp_last_name.entity_id = emp_job.entity_id
  and emp_last_name.entity_id = emp_dept.entity_id
  and emp_last_name.entity_id = emp_sal.entity_id
  and emp_last_name.attribute_name = 'LASTNAME'
  and emp_first_name.attribute_name = 'FIRSTNAME'
  and emp_job.attribute_name = 'JOB'
  and emp_dept.attribute_name = 'DEPARTMENT'
  and emp_sal.attribute_sal = 'SALARY'
order by emp_last_name.attribute_value

Note how the same table is referenced five times in the from clause. The number of self-joins is usually much higher than in this simple example. Furthermore, such queries are frequently spiced up with outer joins as well.

A query with a high number of self-joins performs extremely badly on large volumes; it is clear that the only reason for the numerous conditions in the where clause is to patch all the various "attributes" together. Had the table been defined as the more logical employees(employee_id, last_name, first_name, job_description, department, salary), our query would have been as simple as:

select *
from employees
order by last_name

And the best course for executing this query is obviously a plain table scan. The multiple joins and associated index accesses of the query against employee_attributes are performance killers.

We can never succeed in making a query run as fast against a rotten design as it will run against a clean design. Any clever rewriting of a SQL query against badly designed tables will be nothing more than a wooden leg, returning only some degree of agility to a crippled query. However, we can often obtain spectacular results in comparison to the multiple joins approach by trying to achieve a single pass on the attribute table.

We basically want one row with several attributes (reflecting what the table design should have been in the first place) instead of multiple rows, each with only one attribute of interest per row. Consolidating a multi-row result into a single row is a feat we know how to perform: aggregate functions do precisely this. The idea is therefore to proceed in two steps, as shown in Figure:

  1. Complete each row that contains only one value of interest, with as many dummy values as required to obtain the total number of attributes that we ultimately want.

  2. Aggregate the different rows so as to keep only the single value of interest from each (the single value in each column). A function such as max( ), that has the advantage of being applicable to most data types, is perfect for this kind of operation.

To be certain that max( ) will only retain meaningful values, we must use dummy values that will necessarily be smaller than any legitimate value we may have in a given column. It is probably better to use an explicit value rather than null as a dummy value, even though max( ) ignores null values according to the standard.

If we apply the "recipe" illustrated in Figure to our previous example, we can get rid of the numerous joins by writing:

Transmogrification of several rows into one row

select employee_id,
       max(last_name)        last_name,
       max(first_name)       first_name,
       max(job_description)  job_desription,
       max(department)       department,
       max(salary)           salary
from  -- select all the rows of interest, returning
      -- as many columns as we have rows, one column
      -- of interest per row and values smaller
      -- than any value of interest everywhere else
     (select entity_id                   employee_id,
             case attribute_name
               when 'LASTNAME' then attribute_value
               else ''
             end                         last_name,
             case attribute_name
               when 'FIRSTNAME' then attribute_value
               else ''
             end                         first_name,
             case attribute_name
               when 'JOB' then attribute_value
               else ''
             end                         job_description,
             case attribute_name
               when 'DEPARTMENT' then attribute_value
               else -1
             end                         department,
             case attribute_name
               when 'SALARY' then attribute_value
               else -1
             end                         salary
      from employee_attributes
      where attribute_name in ('LASTNAME',
                               'SALARY')) as inner
group by inner.employee_id
order by 2

The inner query is not strictly requiredwe could have used a series of max(case when ... end)--but the query as written makes the two steps appear more clearly.

An aggregate is not, as you might expect, the best option in terms of performance. But in the kingdom of the blind, the one-eyed man is king, and this type of query just shown usually has no trouble outperforming one having a monstrous number of self-joins. A word of caution, though: in order to accommodate any unexpectedly lengthy attribute, the attribute_value column is usually a fairly large variable-length string. As a result, the aggregation process may require a significant amount of memory, and in some extreme cases you may run into difficulties if the number of attributes exceeds a few dozen.

Multiple self-joins can often be avoided by retrieving all rows in a single pass, spreading the values across separate columns, and using an aggregate function to collapse the many rows into one.

Columns That Should Have Been Rows

In contrast to the previous design in which rows have been defined for each attribute, another example of poor design occurs where columns are created instead of individual rows. The classic design mistake made by many beginners is to predefine a fixed number of columns for a number of variables, with some of the columns set to null when values are missing. A typical example is illustrated in Figure, with a very poorly designed movie database (compare this design to the correct design of Figure in Chapter 8).

A badly designed movie database

Instead of using a movie_credits table as we did in Chapter 8 to link the movies table to the people table and record the nature of each individual's involvement, the poor design shown in Figure assumes that we will never need to record more than a fixed number of lead actors and one director. The first assumption is blatantly wrong and so is the second one since many sketch comedies have had multiple directors. As a representation of reality, this model is plainly flawed, which should already be sufficient reason to discard it. To make matters worse, a poor design, in which data is stored as columns and yet reporting output obviously requires data to be presented as rows, often results in rather confusing queries. Unfortunately, writing queries against poor database designs seems to be as unavoidable as taxes and death in the world of SQL development.

When you want different columns to be displayed as rows, you need a pivot table. Pivot tables are used to pivot, or turn sideways, tables where we want to see columns as rows. A pivot table is, in the context of SQL databases, a utility table that contains only one column, filled with incrementing values from 1 to whatever is needed. It can be a true table or a viewor even a query embedded in the from clause of a query. Using such a utility table is a favorite old trick of experienced SQL developers, and the next few subsections show how to create and use them.

Creating a pivot table

The constructs you have seen in Chapter 7 for walking trees are usually quite convenient for generating pivot table values; for instance, we can use a recursive withaction with those database systems that support it. Here is a DB2 example to generate numbers from 1 to 50:

with pivot(row_num)
   -- Generate 50 values
   -- 1 to 50, one value per row
   as (select 1 row_num
      from sysibm.sysdummy1
      union all
      select row_num + 1
      from pivot
      where row_num < 50)
select row_num
from pivot;

Similar tricks are of course possible with Oracle's connect by; for instance:[*]

[*] Beware that such a construct may not work with some older versions of Oracle.

select level
from dual
connect by level <= 50

Using one of these constructs inside the from clause of a query can make that query particularly illegible, and it is therefore often advisable to use a regular table as pivot. But a recursive query can be useful to fill the pivot table (an alternate solution to fill a pivot table is to use Cartesian joins between existing tables). Typically, a pivot table would hold something like 1,000 rows.

Multiplying rows with a pivot table

Now that we have a pivot table, what can we do with it? One way to look at a pivot table is to view it as a row-multiplying device. By combining a pivot to a table we want to see pivoted, we repeat each of the rows of the table to be transformed as many times as we wish. Specifying the number of times we want to see one row repeated is simply a matter of adding to the join a limiting condition on the pivot table, for instance:

where pivot.row_num <= multiplying value

We can thus multiply the three rows in a test employees table in a very simple way. First, here are the three rows:

SQL> select name, job
  2  from employees;
NAME       JOB
---------- ------------------------------
Tom        Manager
Dick       Software engineer
Harry      Software engineer

And now, here is the multiplication, by three, of those rows:

SQL> select e.name, e.job, p.row_num
  2  from employees e,
  3       pivot     p
  4  where p.row_num <= 3;

NAME       JOB                               ROW_NUM
---------- ------------------------------ ----------
Tom        Manager                                 1
Dick       Software engineer                       1
Harry      Software engineer                       1
Tom        Manager                                 2
Dick       Software engineer                       2
Harry      Software engineer                       2
Tom        Manager                                 3
Dick       Software engineer                       3
Harry      Software engineer                       3

9 rows selected.

It's best to index the only column in the pivot table so as not to fully scan this table when you need to use very few rows from it (as in the preceding example).

Using pivot table values

Besides the mere multiplying effect, the Cartesian join also allows us to associate a unique number in the range 1 to multiplying value for every copy of a row of the table we want to pivot. This value is simply the row_num column contributed by the pivot table, and it will enable us in turn to pick from each copy of a row only partial data. The full process of multiplication of the source rows and selection is illustrated, with a single row, in Figure. If we want the initial row to finally appear as a single column (which by the way implicitly requires the data types of col1 ... coln to be consistent), we must pick just one column into each of the rows generated by the Cartesian product. By checking the number coming from the pivot table, we can specify with a case for each resulting row which column is to be displayed to the exclusion of all the others. For instance, we can decide to display col1 if the value coming from the pivot table is 1, col2 if it is 2, and so on.

Pivoting a row

Needless to say, multiplying rows and discarding most of the columns we are dealing with is not the most efficient way of processing data; keep in mind that we are rowing upstream. An ideal database design would avoid the need for such multiplication and discarding.

Interestingly, and still in the hypothetical situation of a poor (to put it mildly) database design, a pivot table can in some circumstances bring direct performance benefits. Let's suppose that, in our badly designed movie database, we want to count how many different actors are recorded (note that none of the actor_... columns are indexed, and that we therefore have to fetch the values from the table). One way to write this query is to use a union:

select count(*)
from (select actor_1
      from movies
      select actor_2
      from movies
      select actor_3
      from movies) as m

But we can also pivot the table to obtain something that looks more like a select on the movie_credits table of the properly designed database:

select count(distinct actor_id)
from --  Use a 3-row pivot to multiply
     --  the number of rows by 3
     --  and return actor_1 the first row in each
     --  set of 3, actor_2 for the second one
     --  and actor_3 for the third one
     (select case pv.row_num
                when 1 then actor_1
                when 2 then actor_2
                else actor_3
             end actor_id
      from movies as m,
           pivot as pv
      where pv.row_num <= 3) as m

The second version runs about twice as fast as the first onesignificantly faster.

The pivot and unpivot operators

As a possibly sad acknowledgment of the generally poor quality of database designs, SQL Server 2005 has introduced two operators called pivot and unpivot to perform the toppling of rows into columns and vice-versa, respectively. The previous employee_attributes example can be written as follows using the pivot operator:

select entity_id as employee_id,
from employee_attributes as employees
     pivot (max(attribute_value)
            for attribute_name in ([lastname],
                 as pivoted_employees
order by 2

The specific values in the attribute_name column that we want to appear as columns are listed in the for ... in clause, using a particular syntax that transforms the character data into column identifiers. There is an implicit group by applied to all the columns from employee_attributes that are not referenced in the pivot clause; we must be careful if we have other columns (for instance, an attribute_type column) than entity_id, as they may require an additional aggregation layer.

The unpivot operator performs the reverse operation, and allows us to see the link between movie and actor as a more logical collection of (movie_id, actor_id) pairs by writing:

select movie_id, actor_type, actor_id
from movies
     unpivot (actor_id for actor_type in ([actor_1],
                                          [actor_3])) as movie_actors

Note that this query doesn't exactly produce the result we want, since it introduces the name of the original column as a virtual actor_type column. There is no need to qualify actors as actor_1, actor_2, or actor_3, and once again the query may need to be wrapped into another query that only returns movie_id and actor_id.

The use of a pivot table, or of the pivot and unpivot operators, is a very interesting technique that can help extricate us from more than one quagmire. The support for pivoting operators by major database systems is not, of course, to be interpreted as an endorsement of bad design, but as an example of realpolitik.

Pivot tables and operators can be a useful technique in their own right, but they should never be used as a means of glossing over the inadequacies of a bad design.

Single Columns That Should Have Been Something Else

Some designers of our movie database may well have been sensitive to the limitation on the number of actors we may associate with one movie. Trying to solve design issues with a creative use of irrelevant techniques, someone may have come up with a "bright idea": what about storing the actor identifiers as a comma-separated string in one wide actors column? For instance:

first actor id, second actor id, ...

And so much for the first normal form.... The big design mistake here is to store several pieces of data that we need to handle one by one into one column. There would be no issue if a complex stringfor instance a lengthy XML messagewere considered as an opaque object by the DBMS and handled as if it were an atomic item. But that's not the case here. Here we have several values in one column, and we do want to treat and manipulate each value individually. We are in trouble.

There are only two workable solutions with a creative design of this sort:

  • Scrapping it and rewriting everything. This is, of course, by far the best solution.

  • When delays, costs, and politics require a fast solution, the only way out may be to apply a creative SQL solution; once again, let me state that "solution" is probably not the best choice of words in this case, "fix" would be a better description.

I'll also point out that a more elaborate version of the same mistake could use an "XML type" column; I am going to use simple character-string manipulation functions in my example, but they could as well be XML-extracting functions.

Be warned: "creative SQL" is often a euphemism for ugly SQL!

First normal form on the fly

Our problem is to extract various individual components from a string of characters and return them one by one on separate rows. This is easier with some database systems (for instance, Oracle has a very rich set of string functions that noticeably eases the work) than with others. Conventions such as systematically starting or ending the string with a comma may further help us. We are not wimps, but real SQL developers, and we are therefore going to take the north face route and assume the worst:

  1. First, let's assume that our lists of identifiers are in the following form:

    id1, id2, id3, ..., idn

  2. Second, we shall also assume that the only sets of functions at our disposal are those common to the major database systems. We shall use Transact-SQL for our example and only use built-in functions. As you will see, a well-designed user function might ease both the writing and the performance of the resulting query.

Let's start with a (very small) movies table in which a list of actor identifiers is (wrongly) stored as an attribute of the movie:

1> select movie_id, actors
2> from movies
3> go
 movie_id              actors
 --------------------- ----------------------------------------
                     1 123,456,78,96
                     2 23,67,97
                     3 67,456

(3 rows affected)

The first step is to use as many rows from our pivot table as we may have characters in the actors stringarbitrarily set to a maximum length of 50 characters. We are going to multiply the number of rows in the movies table by this number, 50. We would naturally be rather reluctant to do something similar on millions of rows (as an aside, a function allowing us to return the position of the nth separator or the nth item in the string would make it necessary to multiply only by the maximum number of identifiers we can encounter, instead of by the maximum string length).

Our next move is to use the substring( ) function to successively get subsets (that can be null) of actors, starting at the first character, then moving to the second, and so forth, up to the last character (at most, the 50th character). We just have to use the row_num value from the pivot table to find the starting character of each substring. If we take for instance the string from the actors column that is associated to the movie identified by the value 1 for movie_id, we shall get something like:

123,456,78,96  associated to the row_num value 1
23,456,78,96   associated to the row_num value 2
3,456,78,96    associated to the row_num value 3
,456,78,96      ....

We'll compute these subsets in a column that we'll call substring1. Having these successive substrings, we can now check the position of the first comma in them. Our next move is to return as a column called substring2 the content of substring1 shifted by one position. We also locate the position of the first comma in substring2. These operations are illustrated in Figure. Among the various resulting rows, the only ones to be of interest are those marking the beginning of a new identifier in the string: the first row in the series that is associated with the row_num value of 1, and all the rows for which we find a comma in first position of substring1. For all these rows, the position of the comma in substring2 tells us the length of the identifier that we are trying to isolate.

Splitting-up a comma separated list

Translated into SQL code, here is what we get:

1> select row_num,
2>        movie_id,
3>        actors,
4>        first_sep,
5>        next_sep
6> from (select row_num,
7>              movie_id,
8>              actors,
9>              charindex(',', substring(actors, row_num,
10>                                      char_length(actors))) first_sep,
11>             charindex(',', substring(actors, row_num + 1,
12>                                  char_length(actors))) + 1 next_sep
13>      from movies,
14>           pivot
15>     where row_num <= 50) as q
16> where row_num = 1
17>    or first_sep = 1
18> go
 row_num     movie_id       actors             first_sep   next_sep
 ----------- -------------- ------------------ ----------- -----------
           1              1 123,456,78,96                4           4
           4              1 123,456,78,96                1           5
           8              1 123,456,78,96                1           4
          11              1 123,456,78,96                1           1
           1              2 23,67,97                     3           3
           3              2 23,67,97                     1           4
           6              2 23,67,97                     1           1
           1              3 67,456                       3           3
           3              3 67,456                       1           1

(9 rows affected)

If we accept that we must take some care to remove commas, and the particular cases of both the first and last identifiers in a list, getting the various identifiers is then reasonably straightforward, even if the resulting code is not for the faint-hearted:

1> select movie_id,
2>        actors,
3>        substring(actors,
4>                  case row_num
5>                    when 1 then 1
6>                    else row_num + 1
7>                  end,
8>                  case next_sep
9>                    when 1 then char_length(actors)
10>                   else
11>                     case row_num
12>                       when 1 then next_sep - 1
13>                       else next_sep - 2
14>                     end
15>                 end) as id
16> from (select row_num,
17>              movie_id,
18>              actors,
19>              first_sep,
20>              next_sep
21>       from (select row_num,
22>                    movie_id,
23>                    actors,
24>                    charindex(',', substring(actors, row_num,
25>                              char_length(actors))) first_sep,
26>                    charindex(',', substring(actors, row_num + 1,
27>                              char_length(actors))) + 1 next_sep
28>             from movies,
29>                  pivot
30>             where row_num <= 50) as q
31>       where row_num = 1
32>          or first_sep = 1) as q2
33> go
 movie_id         actors                         id
 ---------------- ------------------------------ -----------------
                1 123,456,78,96                  123
                1 123,456,78,96                  456
                1 123,456,78,96                  78
                1 123,456,78,96                  96
                2 23,67,97                       23
                2 23,67,97                       67
                2 23,67,97                       97
                3 67,456                         67
                3 67,456                         456

(9 rows affected)

We could have made the code slightly simpler by prepending and appending a comma to the actors column. I leave doing that as an exercise for the undaunted reader. Note that as the left alignment shows, the resulting id column is a string and should be explicitly converted to numeric before joining to the table that stores the actors' names.

The preceding case, besides being an interesting example of solving a SQL problem by successively wrapping queries, also comes as a healthy warning of what awaits us on the SQL side of things when tables are poorly designed.

Lifting the veil on the Chapter 7 mystery path explosion

You may remember that in Chapter 7 I described the materialized path model for tree representations. In that chapter I noted that it would be extremely convenient if we could "explode" a materialized path into the different materialized paths of all its ancestors. The advantage of this method is that when we want to walk a hierarchy from the bottom up, we can make efficient use of the index that should hopefully exist on the materialized path. If we don't "explode" the materialized path, the only way we have to find the ancestors of a given row is to specify a condition such as:

and offspring.materialized_path
    like concat(ancestor.materialized_path, '%')

Sadly, this is a construct that cannot use the index (for reasons that are quite similar to those in the credit card prefix problem of Chapter 8).

How can we "explode" the materialized path? The time has come to explain how we can pull that rabbit out of the hat. Since our node will have, in the general case, several ancestors, the very first thing we have to do is to multiply the rows by the number of preceding generations. In this way we'll be able to extract from the materialized path of our initial row (for example the row that represents the Hussar regiment under the command of Colonel de Marbot) the paths of the various ancestors. As always, the solution for multiplying rows is to use a pivot table. If we do it this time with MySQL, there is a function called substring_index( ) that very conveniently returns the substring of its first argument from the beginning up to the third argument occurrence of the second argument (hopefully, the example is easier to understand). To know how many rows we need from the pivot table, we just compute how many elements we have in the path in exactly the same way that we computed the depth in Chapter 7, namely by comparing the length of the path to the length of the same when separators have been stripped off. Here is the query, and the results:

mysql>  select mp.materialized_path,
    ->         substring_index( mp.materialized_path, '.', p.row_num )
    ->                                     as ancestor_path
    -> from materialized_path_model as mp,
    ->      pivot as p
    -> where mp.commander = 'Colonel de Marbot'
    ->   and p.row_num <= 1 + length( mp.materialized_path )
    ->                      - length(replace(mp.materialized_path, '.', ''));
| materialized_path | ancestor_path |
| F.         | F             |
| F.         | F.1           |
| F.         | F.1.5         |
| F.         | F.1.5.1       |
| F.         | F.     |
5 rows in set (0.00 sec)

 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows