Adding Redundant Data
A design that satisfies the guidelines given in the previous section simplifies the formulation of SELECT and UPDATE statements. Processing update statements is fast because each "fact" is registered only once. Processing SELECT statements is another story. Precisely because each fact is recorded only once, many joins must be executed. Processing joins and other SELECT statements can be very time-consuming. One way to tackle this problem is to include redundant data in a table. Here is an example of a join and another SELECT statement, both of which can be executed faster when redundant data has been added.
4. Get the name of each player who incurred at least one penalty.
SELECT NAME, AMOUNT
FROM PENALTIES AS PEN, PLAYERS AS P
WHERE PEN.PLAYERNO = P.PLAYERNO
SQL must perform a join to process this statement. The join can be avoided by storing the NAME column as redundant data in the PENALTIES table.
The new PENALTIES table is:
PAYMENTNO PLAYERNO NAME PAYMENT_DATE AMOUNT
--------- -------- --------- ------------ ------
1 6 Parmenter 1980-12-08 100.00
2 44 Baker 1981-05-05 75.00
3 27 Collins 1983-09-10 100.00
4 104 Moorman 1984-12-08 50.00
5 44 Baker 1980-12-08 25.00
6 8 Newcastle 1980-12-08 25.00
7 44 Baker 1980-12-30 30.00
8 27 Collins 1984-11-12 75.00
The statement then becomes:
SELECT NAME, AMOUNT
This SELECT statement will definitely be executed faster than the previous one.
This method of adding redundant data is sometimes called denormalization. A disadvantage of denormalization is the need to store some facts more than once. The names of players, for example, are now recorded in the PLAYERS table and in the PENALTIES table. Updating the name of a player requires two separate update statements. Another disadvantage of denormalization is that recording the same fact in more than one place uses twice as much storage space. With denormalization, you must weigh the relative importance of faster execution time of SELECT statements against slower execution time of updates and the storage space needed. In practice, this means that denormalization is used more often for data warehouses than for transaction databases.
5. Get, for each player, the total amount of penalties incurred by him or her.
SELECT PLAYERNO, SUM(AMOUNT) AS TOTAL
FROM PLAYERNO LEFT OUTER JOIN PENALTIES
ON PLAYERS.PLAYERNO = PENALTIES.PLAYERNO
GROUP BY PLAYERNO
ORDER BY 1
The result is:
This statement could be greatly simplified if the total amount of penalties were registered in the PLAYERS table. The statement then would become:
SELECT PLAYERNO, TOT_AMOUNT
In this way, the processing time would be greatly reduced, but here also the same disadvantages as for denormalization are found: more updates and duplication of data. We have to conclude, however, that the performance of OLAP and other business intelligence tools improves considerably as a result of denormalization.
In both examples, redundancy took the form of adding one extra column. Creating an entirely new redundant table is sometimes an attractive alternative. (Invent an example for yourself.) Guideline 5 can be derived from this story now.
Guideline 5: Add Redundant Data When the Processing Time of SELECT Statements Is Not Acceptable
Create a design for the MATCHES table so that the following query would no longer need a join. Additionally, give the changed formulation of the SELECT statement.
SELECT M.MATCHNO, M.TEAMNO, T.DIVISION
FROM MATCHES AS M INNER JOIN TEAMS AS T
ON M.TEAMNO = T.TEAMNO
Give an alternative design for the PLAYERS table so that the following query can be answered without joins and subqueries: Find the number and name of each player who has, for a given team, won more sets than the average number of sets won for any team.