Adding Redundant Data






Adding Redundant Data

A design that satisfies the guidelines given in the previous section simplifies the formulation of SELECT and UPDATE statements. Processing update statements is fast because each "fact" is registered only once. Processing SELECT statements is another story. Precisely because each fact is recorded only once, many joins must be executed. Processing joins and other SELECT statements can be very time-consuming. One way to tackle this problem is to include redundant data in a table. Here is an example of a join and another SELECT statement, both of which can be executed faster when redundant data has been added.

4. Get the name of each player who incurred at least one penalty.

SELECT   NAME, AMOUNT
FROM     PENALTIES AS PEN, PLAYERS AS P
WHERE    PEN.PLAYERNO = P.PLAYERNO

SQL must perform a join to process this statement. The join can be avoided by storing the NAME column as redundant data in the PENALTIES table.

The new PENALTIES table is:

PAYMENTNO  PLAYERNO  NAME       PAYMENT_DATE  AMOUNT
---------  --------  ---------  ------------  ------
        1         6  Parmenter  1980-12-08    100.00
        2        44  Baker      1981-05-05     75.00
        3        27  Collins    1983-09-10    100.00
        4       104  Moorman    1984-12-08     50.00
        5        44  Baker      1980-12-08     25.00
        6         8  Newcastle  1980-12-08     25.00
        7        44  Baker      1980-12-30     30.00
        8        27  Collins    1984-11-12     75.00

The statement then becomes:

SELECT   NAME, AMOUNT
FROM     PENALTIES

This SELECT statement will definitely be executed faster than the previous one.

This method of adding redundant data is sometimes called denormalization. A disadvantage of denormalization is the need to store some facts more than once. The names of players, for example, are now recorded in the PLAYERS table and in the PENALTIES table. Updating the name of a player requires two separate update statements. Another disadvantage of denormalization is that recording the same fact in more than one place uses twice as much storage space. With denormalization, you must weigh the relative importance of faster execution time of SELECT statements against slower execution time of updates and the storage space needed. In practice, this means that denormalization is used more often for data warehouses than for transaction databases.

5. Get, for each player, the total amount of penalties incurred by him or her.

SELECT   PLAYERNO, SUM(AMOUNT) AS TOTAL
FROM     PLAYERNO LEFT OUTER JOIN PENALTIES
         ON PLAYERS.PLAYERNO = PENALTIES.PLAYERNO
GROUP BY PLAYERNO
ORDER BY 1

The result is:

PLAYERNO   TOTAL
--------  ------
       2    0.00
       6  100.00
       7    0.00
       8   25.00
      27  175.00
      28    0.00
      39    0.00
      44  130.00
      57    0.00
      83    0.00
      95    0.00
     100    0.00
     104   50.00
     112    0.00

This statement could be greatly simplified if the total amount of penalties were registered in the PLAYERS table. The statement then would become:

SELECT   PLAYERNO, TOT_AMOUNT
FROM     PLAYERS

In this way, the processing time would be greatly reduced, but here also the same disadvantages as for denormalization are found: more updates and duplication of data. We have to conclude, however, that the performance of OLAP and other business intelligence tools improves considerably as a result of denormalization.

In both examples, redundancy took the form of adding one extra column. Creating an entirely new redundant table is sometimes an attractive alternative. (Invent an example for yourself.) Guideline 5 can be derived from this story now.

Guideline 5: Add Redundant Data When the Processing Time of SELECT Statements Is Not Acceptable

Exercise 19.3:

Create a design for the MATCHES table so that the following query would no longer need a join. Additionally, give the changed formulation of the SELECT statement.

SELECT   M.MATCHNO, M.TEAMNO, T.DIVISION
FROM     MATCHES AS M INNER JOIN TEAMS AS T
         ON M.TEAMNO = T.TEAMNO

Exercise 19.4:

Give an alternative design for the PLAYERS table so that the following query can be answered without joins and subqueries: Find the number and name of each player who has, for a given team, won more sets than the average number of sets won for any team.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows