Implementing Set






Implementing Set

When we used the methods of Collection in the examples of Chapter 12, we emphasized that they would work with any implementation of Collection. What if we had decided that we would use one of the Set implementations from the Collections Framework? We would have had to choose between the various concrete implementations that the Framework provides, which differ both in how fast they perform the basic operations of add, contains, and iteration, and in the order in which their iterators return their elements. In this section and the next we will look at these differences, then at the end of the Chapter we will summarize the comparative performance of the different implementations.

There are six concrete implementations of Set in the Collections Framework. Figure shows their relationship to Set and its subinterfaces SortedSet and NavigableSet. In this section, we will look at HashSet, LinkedHashSet, CopyOnWriteArraySet and EnumSet.

Implementations of the Set interface


We will discuss SortedSet and NavigableSet, together with their implementations, TReeSet and ConcurrentSkipListSet, in Section 13.2.

HashSet

This class is the most commonly used implementation of Set. As the name implies, it is implemented by a hash table, an array in which elements are stored at a position derived from their contents. Since hash tables store and retrieve elements by their content, they are well suited to implementing the operations of Set (the Collections Framework also uses them for various implementations of Map). For example, to implement contains(Object o) you would look for the element o and return true if it were found.

An element's position in a hash table is calculated by a hash function of its contents. Hash functions are designed to give, as far as possible, an even spread of results (hash codes) over the element values that might be stored. For example, here is code like that used in the String class to calculate a hash code:

int hash = 0;
for (char ch : str.toCharArray()) {
  hash = hash * 31 + ch;
}

Traditionally, hash tables obtain an index from the hash code by taking the remainder after division by the table length. The Collections Framework classes actually use bit masking rather than division. Since that means it is the pattern of bits at the low end of the hash code that is significant, prime numbers (such as 31, here) are used in calculating the hash code because multiplying by primes will not tend to shift information away from the low endas would multiplying by a power of 2, for example.

A moment's thought will show that, unless your table has more locations than there are values that might be stored in it, sometimes two distinct values must hash to the same location in the hash table. For instance, no int-indexed table can be large enough to store all string values without collisions. We can minimize the problem with a good hash function one which spreads the elements out equally in the tablebut, when collisions do occur, we have to have a way of keeping the colliding elements at the same table location or bucket. This is often done by storing them in a linked list, as shown in Figure. We will look at linked lists in more detail as part of the implementations of ConcurrentSkipListSet (see Section 13.2.2) but, for now, it's enough to see that elements stored at the same bucket can still be accessed, at the cost of following a chain of cell references. Figure shows the situation resulting from running this code on Sun's implementation of Java 5:

A hash table with chained overflow


Set<Character> s1 = new HashSet<Character>(8);
s1.add('a');
s1.add('b');
s1.add('j');

The index values of the table elements have been calculated by using the bottom three bits (for a table of length 8) of the hash code of each element. In this implementation, a Character's hash code is just the Unicode value of the character it contains. (In practice, of course, a hash table would be much bigger than this. Also, this diagram is simplified from the real situation; because HashSet is actually implemented by a specialized HashMap, each of the cells in the chain contains not one but two references, to a key and a value (see Chapter 16). Only the key is shown in this diagram because, when a hash table is being used to represent a set, all values are the sameonly the presence of the key is significant.)

As long as there are no collisions, the cost of inserting or retrieving an element is constant. As the hash table fills, collisions become more likely; assuming a good hash function, the probability of a collision in a lightly loaded table is proportional to its load, defined as the number of elements in the table divided by its capacity (the number of buckets). If a collision does take place, a linked list has to be created and subsequently traversed, adding an extra cost to insertion proportional to the number of elements in the list. If the size of the hash table is fixed, performance will worsen as more elements are added and the load increases. To prevent this from happening, the table size is increased by rehashingcopying to a new and larger tablewhen the load reaches a specified threshold (its load factor).

Iterating over a hash table requires each bucket to be examined to see whether it is occupied and therefore costs a time proportional to the capacity of the hash table plus the number of elements it contains. Since the iterator examines each bucket in turn, the order in which elements are returned depends on their hash codes, so there is no guarantee as to the order in which the elements will be returned. The hash table shown in Figure yields its elements in order of descending table index and forward traversal of the linked lists. Printing it produces the following output:

[j, b, a]

Later in this section we will look at LinkedHashSet, a variant of this implementation with an iterator that does return elements in their insertion order.

The chief attraction of a hash table implementation for sets is the (ideally) constant-time performance for the basic operations of add, remove, contains, and size. Its main disadvantage is its iteration performance; since iterating through the table involves examining every bucket, its cost is proportional to the table size regardless of the size of the set it contains.

HashSet has the standard constructors that we introduced in Section 12.3, together with two additional constructors:

HashSet(int initialCapacity)
HashSet(int initialCapacity, float loadFactor)

Both of these constructors create an empty set but allow some control over the size of the underlying table, creating one with a length of the next-largest power of 2 after the supplied capacity. Most of the hash tablebased implementations in the Collections Framework have similar constructors, although Joshua Bloch, the original designer of the Framework, has told us that new classes will no longer usually have configuration parameters like the load factor; they are not generally useful, and they limit the possibilities of improving implementations at a later date.

HashSet is unsychronized and not thread-safe; its iterators are fail-fast.

LinkedHashSet

This class inherits from HashSet, still implementing Set and refining the contract of its superclass in only one respect: it guarantees that its iterators will return their elements in the order in which they were first added. It does this by maintaining a linked list of the set elements, as shown by the curved arrows in Figure. The situation in the figure would result from this code:

A linked hash table


Set<Character> s2 = new LinkedHashSet<Character>(8);
Collections.addAll(s2, 'a', 'b', 'j');
// iterators of a LinkedHashSet return their elements in proper order:
assert s2.toString().equals("[a, b, j]");

The linked structure also has a useful consequence in terms of improved performance for iteration: next performs in constant time, as the linked list can be used to visit each element in turn. This is in contrast to HashSet, for which every bucket in the hash table must be visited whether it is occupied or not, but the overhead involved in maintaining the linked list means that you would choose LinkedHashSet in preference to HashSet only if the order or the efficiency of iteration were important for your application.

The constructors for LinkedHashSet provide the same facilities as those of HashSet for configuring the underlying hash table. Like HashSet, it is unsychronized and not thread-safe; its iterators are fail-fast.

CopyOnWriteArraySet

In functional terms, CopyOnWriteArraySet is another straightforward implementation of the Set contract, but with quite different performance characteristics from HashSet. This class is implemented as a thin wrapper around an instance of CopyOnWriteArrayList, which in turn is backed by an array. This array is treated as immutable; a change to the contents of the set results in an entirely new array being created. So add has complexity O(n), as does contains, which has to be implemented by a linear search. Clearly you wouldn't use CopyOnWriteArraySet in a context where you were expecting many searches or insertions. But the array implementation means that iteration costs O(1) per elementfaster than HashSetand it has one advantage which is really compelling in some applications: it provides thread safety (see Section 11.5) without adding to the cost of read operations. This is in contrast to those collections which use locking to achieve thread safety for all operations (for example, the synchronized collections of Section 17.3.1). Such collections are a bottleneck in multi-threaded use because a thread must get exclusive access to the collection object before it can use it in any way. By contrast, read operations on copy-on-write collections are implemented on the backing array, which is never modified after its creation, so they can be used by any thread without danger of interference from a concurrent write operation.

When would you want to use a set with these characteristics? In fairly specialized cases; one that is quite common is in the implementation of the Subject-Observer design pattern (see Section 9.5), which requires events to be notified to a set of observers. This set must not be modified during the process of notification; with locking set implementations, read and write operations share the overhead necessary to ensure this, whereas with CopyOnWriteArraySet the overhead is carried entirely by write operations. This makes sense for Subject-Observer; in typical uses of this pattern, event notifications occur much more frequently than changes to the listener set.

Iterators for CopyOnWriteArraySet can be used only to read the set. When they are created, they are attached to the instance of the backing array being used by the set at that moment. Since no instance of this array should ever be modified, the iterators' remove method is not implemented. These are snapshot iterators (see Section 11.5); they reflect the state of the set at the time it was created, and can subsequently be traversed without any danger of interference from threads modifying the set from which they were derived.

Since there are no configuration parameters for CopyOnWriteArraySet, the constructors are just the standard ones discussed in Section 12.3.

EnumSet

This class exists to take advantage of the efficient implementations that are possible when the number of possible elements is fixed and a unique index can be assigned to each. These two conditions hold for a set of elements of the same Enum; the number of keys is fixed by the constants of the enumerated type, and the ordinal method returns values that are guaranteed to be unique to each constant. In addition, the values that ordinal returns form a compact range, starting from zeroideal, in fact, for use as array indices or, in the standard implementation, indices of a bit vector. So add, remove, and contains are implemented as bit manipulations, with constant-time performance. Bit manipulation on a single word is extremely fast, and a long value can be used to represent EnumSets over enum types with up to 64 values. Larger enums can be treated in a similar way, with some overhead, using more than one word for the representation.

EnumSet is an abstract class that implements these different representations by means of different package-private subclasses. It hides the concrete implementation from the programmer, instead exposing factory methods that call the constructor for the appropriate subclass. The following group of factory methods provide ways of creating EnumSets with different initial contents: empty, specified elements only, or all elements of the enum.

<E extends Enum<E>> EnumSet<E> of(E first, E... rest)
        // create a set initially containing the specified elements
<E extends Enum<E>> EnumSet<E> range(E from, E to)
        // create a set initially containing all of the elements in
        // the range defined by the two specified endpoints
<E extends Enum<E>> EnumSet<E> allOf(Class<E> elementType)
        // create a set initially containing all elements in elementType
<E extends Enum<E>> EnumSet<E> noneOf(Class<E> elementType)
        // create a set of elementType, initially empty

An EnumSet contains the reified type of its elements, which is used at run time for checking the validity of new entries. This type is supplied by the above factory methods in two different ways. The methods of and range receive at least one enum argument, which can be queried for its declaring class (that is, the Enum that it belongs to). For allOf and noneOf, which have no enum arguments, a class token is supplied instead.

Common cases for EnumSet creation are optimized by the second group of methods, which allow you to efficiently create sets with one, two, three, four, or five elements of an enumerated type.

<E extends Enum<E>> EnumSet<E> of(E e)
<E extends Enum<E>> EnumSet<E> of(E e1, E e2)
<E extends Enum<E>> EnumSet<E> of(E e1, E e2, E e3)
<E extends Enum<E>> EnumSet<E> of(E e1, E e2, E e3, E e4)
<E extends Enum<E>> EnumSet<E> of(E e1, E e2, E e3, E e4, E e5)

The third set of methods allows the creation of an EnumSet from an existing collection:

<E extends Enum<E>> EnumSet<E> copyOf(EnumSet<E> s)
      // create an EnumSet with the same element type as s, and
      // with the same elements
<E extends Enum<E>> EnumSet<E> copyOf(Collection<E> c)
      // create an EnumSet from the elements of c, which must contain
      // at least one element
<E extends Enum<E>> EnumSet<E> complementOf(EnumSet<E> s)
      // create an EnumSet with the same element type as s,
      // containing the elements not in s

The collection supplied as the argument to the second version of copyOf must be nonempty so that the element type can be determined.

In use, EnumSet obeys the contract for Set, with the added specification that its iterators will return their elements in their natural order (the order in which their enum constants are declared). It is not thread-safe, but unlike the unsynchronized general-purpose collections, its iterators are not fail-fast. They may be either snapshot or weakly consistent; to be conservative, the contract guarantees only that they will be weakly consistent (see Section 11.5).



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows