April 8, 2011, 8:11 p.m.

posted by sunny

## Balanced Trees

The bst algorithms in the previous chapter work well for a wide variety of applications, but they do have the problem of bad worst-case performance. What is more, it is embarrassingly true that the bad worst case for the standard BST algorithm, like that for quicksort, is one that is likely to occur in practice if the user of the algorithm is not watching for it. Files already in order, files with large numbers of duplicate keys, files in reverse order, files with alternating large and small keys, or files with any large segment having a simple structure can all lead to quadratic BST construction times and linear search times.

In the ideal case, we could keep our trees perfectly balanced, like the tree depicted in Figure. This structure corresponds to binary search and therefore allows us to guarantee that all searches can be completed in less than lg N + 1 comparisons, but it is expensive to maintain for dynamic insertions and deletions. The search performance guarantee holds for any BST for which all the external nodes are on the bottom one or at most two levels; there are many such BSTs, so we have some flexibility in arranging for our tree to be balanced. If we are satisfied with near-optimal trees, then we can have even more flexibility. For example, there are a great many BSTs of height less than 2 lg N. If we relax our standard but can guarantee that our algorithms build only such BSTs, then we can provide the protection against bad worst-case performance which we would like to have in practical applications in a dynamic data structure. As a side benefit, we get better average-case performance as well.

##### 1. A large BST that is perfectly balanced

The external nodes in this BST all fall on one of two levels, and the number of comparisons for any search is the same as the number of comparisons that would be used by binary search for the same key (if the items were in an ordered array). The goal of a balanced-tree algorithm is to keep a BST as close as possible to being as well balanced as this one, while still supporting efficient dynamic insertion, deletion, and other dictionary ADT operations.

One approach to producing better balance in BSTs is periodically to rebalance them explicitly. Indeed, we can balance most BSTs completely in linear time, using the recursive method shown in Program 13.1 (see Exercise 13.4). Such rebalancing is likely to improve performance for random keys but does not provide guarantees against quadratic worst-case performance in a dynamic symbol table. On the one hand, the insertion time for a sequence of keys between rebalancing operations can grow quadratic in the length of the sequence; on the other hand, we do not want to rebalance huge trees frequently, because each rebalancing operation costs at least linear time in the size of the tree. This tradeoff makes it difficult to use global rebalancing to guarantee fast performance in dynamic BSTs. All the algorithms that we will consider, as they walk through the tree, do incremental, local operations that collectively improve the balance of the whole tree, yet they never have to walk through all the nodes in the way that Program 13.1 does.

The problem of providing guaranteed performance for symbol-table implementations based on BSTs gives us an excellent forum for examining precisely what we mean when we ask for performance guarantees. We shall see solutions to this problem that are prime examples of each of the three general approaches to providing performance guarantees in algorithm design: we can randomize, amortize, or optimize. We now consider each of these approaches briefly, in turn.

A randomized algorithm introduces random decision making into the algorithm itself in order to reduce dramatically the chance of a worst-case scenario (no matter what the input). We have already seen a prime example of this arrangement when we used a random element as the partitioning element in quicksort. In Sections 13.1 and 13.5, we shall examine randomized BSTs and skip lists—two sim-ple ways to use randomization in symbol-table implementations to give efficient implementations of all the symbol-table ADT operations. These algorithms are simple and broadly applicable but went undiscovered for decades (see reference section). The analysis that proves these algorithms to be effective is not elementary, but the algorithms are simple to understand, implement, and put to practical use.

### Program 13.1 Balancing a BST

This recursive method puts a BST into perfect balance in linear time, using the partitioning method `partR` from Program 12.21. We partition to put the median node at the root, then (recursively) do the same for the subtrees.

private Node balanceR(Node h) { if ((h == null) || (h.N == 1)) return h; h = partR(h, h.N/2); h.l = balanceR(h.l); h.r = balanceR(h.r); fixN(h.l); fixN(h.r); fixN(h); return h; }

An amortization approach does extra work at one time to avoid more work later in order to be able to provide guaranteed upper bounds on the average per-operation cost (the total cost of all operations divided by the number of operations). In Section 13.2,we shall examine splay BSTs, a variant of BSTs that we can use to provide such guarantees for symbol-table implementations. The development of this method was one impetus for the development of the concept of amortization (see reference section). The algorithm is a straightforward extension of the root insertion method that we discussed in Chapter 12, but the analysis that proves the performance bounds is sophisticated.

An optimization approach takes the trouble to provide performance guarantees for every operation. Various methods have been developed that take this approach, some dating back to the 1960s. These methods require that we maintain some structural information in the trees; programmers typically find the algorithms cumbersome to implement. In this chapter, we shall examine two simple abstractions that not only make the implementation straightforward but also lead to near-optimal upper bounds on the costs.

After examining implementations of symbol-table ADTs with guaranteed fast performance using each of these three approaches, we conclude the chapter with a comparison of performance characteristics. Beyond the differences suggested by the differing natures of the performance guarantees that each of the algorithms provides, the methods each carry a (relatively slight) cost in time or space to provide those guarantees; the development of a truly optimal balanced-tree ADT is still a research goal. Still, the algorithms that we consider in this chapter are all important ones that can provide fast implementations of search and insert (and several other symbol-table ADT operations) in dynamic symbol tables for a variety of applications.

- Comment