Feb. 26, 2011, 9:29 a.m.
posted by datamaker
Reducing Lock Contention
We've seen that serialization hurts scalability and that context switches hurt performance. Contended locking causes both, so reducing lock contention can improve both performance and scalability.
Access to resources guarded by an exclusive lock is serializedonly one thread at a time may access it. Of course, we use locks for good reasons, such as preventing data corruption, but this safety comes at a price. Persistent contention for a lock limits scalability.
Two factors influence the likelihood of contention for a lock: how often that lock is requested and how long it is held once acquired. If the product of these factors is sufficiently small, then most attempts to acquire the lock will be uncontended, and lock contention will not pose a significant scalability impediment. If, however, the lock is in sufficiently high demand, threads will block waiting for it; in the extreme case, processors will sit idle even though there is plenty of work to do.
Narrowing Lock Scope ("Get in, Get Out")
An effective way to reduce the likelihood of contention is to hold locks as briefly as possible. This can be done by moving code that doesn't require the lock out of synchronized blocks, especially for expensive operations and potentially blocking operations such as I/O.
It is easy to see how holding a "hot" lock for too long can limit scalability; we saw an example of this in SynchronizedFactorizer in Chapter 2. If an operation holds a lock for 2 milliseconds and every operation requires that lock, throughput can be no greater than 500 operations per second, no matter how many processors are available. Reducing the time the lock is held to 1 millisecond improves the lock-induced throughput limit to 1000 operations per second.
AttributeStore in Listing 11.4 shows an example of holding a lock longer than necessary. The userLocationMatches method looks up the user's location in a Map and uses regular expression matching to see if the resulting value matches the supplied pattern. The entire userLocationMatches method is synchronized, but the only portion of the code that actually needs the lock is the call to Map.get.
Holding a Lock Longer than Necessary.
BetterAttributeStore in Listing 11.5 rewrites AttributeStore to reduce significantly the lock duration. The first step is to construct the Map key associated with the user's location, a string of the form users.name.location. This entails instantiating a StringBuilder object, appending several strings to it, and instantiating the result as a String. After the location has been retrieved, the regular expression is matched against the resulting location string. Because constructing the key string and processing the regular expression do not access shared state, they need not be executed with the lock held. BetterAttributeStore factors these steps out of the synchronized block, thus reducing the time the lock is held.
Reducing Lock Duration.
Reducing the scope of the lock in userLocationMatches substantially reduces the number of instructions that are executed with the lock held. By Amdahl's law, this removes an impediment to scalability because the amount of serialized code is reduced.
Because AttributeStore has only one state variable, attributes, we can improve it further by the technique of delegating thread safety (Section 4.3). By replacing attributes with a thread-safe Map (a Hashtable, synchronizedMap, or ConcurrentHashMap), AttributeStore can delegate all its thread safety obligations to the underlying thread-safe collection. This eliminates the need for explicit synchronization in AttributeStore, reduces the lock scope to the duration of the Map access, and removes the risk that a future maintainer will undermine thread safety by forgetting to acquire the appropriate lock before accessing attributes.
While shrinking synchronized blocks can improve scalability, a synchronized block can be too smalloperations that need to be atomic (such updating multiple variables that participate in an invariant) must be contained in a single synchronized block. And because the cost of synchronization is nonzero, breaking one synchronized block into multiple synchronized blocks (correctness permitting) at some point becomes counterproductive in terms of performance. The ideal balance is of course platform-dependent, but in practice it makes sense to worry about the size of a synchronized block only when you can move "substantial" computation or blocking operations out of it.
Reducing Lock Granularity
The other way to reduce the fraction of time that a lock is held (and therefore the likelihood that it will be contended) is to have threads ask for it less often. This can be accomplished by lock splitting and lock striping, which involve using separate locks to guard multiple independent state variables previously guarded by a single lock. These techniques reduce the granularity at which locking occurs, potentially allowing greater scalabilitybut using more locks also increases the risk of deadlock.
As a thought experiment, imagine what would happen if there was only one lock for the entire application instead of a separate lock for each object. Then execution of all synchronized blocks, regardless of their lock, would be serialized. With many threads competing for the global lock, the chance that two threads want the lock at the same time increases, resulting in more contention. So if lock requests were instead distributed over a larger set of locks, there would be less contention. Fewer threads would be blocked waiting for locks, thus increasing scalability.
If a lock guards more than one independent state variable, you may be able to improve scalability by splitting it into multiple locks that each guard different variables. This results in each lock being requested less often.
ServerStatus in Listing 11.6 shows a portion of the monitoring interface for a database server that maintains the set of currently logged-on users and the set of currently executing queries. As a user logs on or off or query execution begins or ends, the ServerStatus object is updated by calling the appropriate add or remove method. The two types of information are completely independent; ServerStatus could even be split into two separate classes with no loss of functionality.
Instead of guarding both users and queries with the ServerStatus lock, we can instead guard each with a separate lock, as shown in Listing 11.7. After splitting the lock, each new finer-grained lock will see less locking traffic than the original coarser lock would have. (Delegating to a thread-safe Set implementation for users and queries instead of using explicit synchronization would implicitly provide lock splitting, as each Set would use a different lock to guard its state.)
Splitting a lock into two offers the greatest possibility for improvement when the lock is experiencing moderate but not heavy contention. Splitting locks that are experiencing little contention yields little net improvement in performance or throughput, although it might increase the load threshold at which performance starts to degrade due to contention. Splitting locks experiencing moderate contention might actually turn them into mostly uncontended locks, which is the most desirable outcome for both performance and scalability.
Candidate for Lock Splitting.
Listing 11.7. ServerStatus Refactored to Use Split Locks.
Splitting a heavily contended lock into two is likely to result in two heavily contended locks. While this will produce a small scalability improvement by enabling two threads to execute concurrently instead of one, it still does not dramatically improve prospects for concurrency on a system with many processors. The lock splitting example in the ServerStatus classes does not offer any obvious opportunity for splitting the locks further.
Lock splitting can sometimes be extended to partition locking on a variablesized set of independent objects, in which case it is called lock striping. For example, the implementation of ConcurrentHashMap uses an array of 16 locks, each of which guards 1/16 of the hash buckets; bucket N is guarded by lock N mod 16. Assuming the hash function provides reasonable spreading characteristics and keys are accessed uniformly, this should reduce the demand for any given lock by approximately a factor of 16. It is this technique that enables ConcurrentHashMap to support up to 16 concurrent writers. (The number of locks could be increased to provide even better concurrency under heavy access on high-processor-count systems, but the number of stripes should be increased beyond the default of 16 only when you have evidence that concurrent writers are generating enough contention to warrant raising the limit.)
One of the downsides of lock striping is that locking the collection for exclusive access is more difficult and costly than with a single lock. Usually an operation can be performed by acquiring at most one lock, but occasionally you need to lock the entire collection, as when ConcurrentHashMap needs to expand the map and rehash the values into a larger set of buckets. This is typically done by acquiring all of the locks in the stripe set.
StripedMap in Listing 11.8 illustrates implementing a hash-based map using lock striping. There are N_LOCKS locks, each guarding a subset of the buckets. Most methods, like get, need acquire only a single bucket lock. Some methods may need to acquire all the locks but, as in the implementation for clear, may not need to acquire them all simultaneously.
Avoiding Hot Fields
Lock splitting and lock striping can improve scalability because they enable different threads to operate on different data (or different portions of the same data structure) without interfering with each other. A program that would benefit from lock splitting necessarily exhibits contention for a lock more often than for the data guarded by that lock. If a lock guards two independent variables X and Y, and thread A wants to access X while B wants to access Y (as would be the case if one thread called addUser while another called addQuery in ServerStatus), then the two threads are not contending for any data, even though they are contending for a lock.
Hash-based Map Using Lock Striping.
Lock granularity cannot be reduced when there are variables that are required for every operation. This is yet another area where raw performance and scalability are often at odds with each other; common optimizations such as caching frequently computed values can introduce "hot fields" that limit scalability.
If you were implementing HashMap, you would have a choice of how size computes the number of entries in the Map. The simplest approach is to count the number of entries every time it is called. A common optimization is to update a separate counter as entries are added or removed; this slightly increases the cost of a put or remove operation to keep the counter up-to-date, but reduces the cost of the size operation from O(n) to O(1).
Keeping a separate count to speed up operations like size and isEmpty works fine for a single-threaded or fully synchronized implementation, but makes it much harder to improve the scalability of the implementation because every operation that modifies the map must now update the shared counter. Even if you use lock striping for the hash chains, synchronizing access to the counter reintroduces the scalability problems of exclusive locking. What looked like a performance optimizationcaching the results of the size operationhas turned into a scalability liability. In this case, the counter is called a hot field because every mutative operation needs to access it.
ConcurrentHashMap avoids this problem by having size enumerate the stripes and add up the number of elements in each stripe, instead of maintaining a global count. To avoid enumerating every element, ConcurrentHashMap maintains a separate count field for each stripe, also guarded by the stripe lock.
Alternatives to Exclusive Locks
A third technique for mitigating the effect of lock contention is to forego the use of exclusive locks in favor of a more concurrency-friendly means of managing shared state. These include using the concurrent collections, read-write locks, immutable objects and atomic variables.
ReadWriteLock (see Chapter 13) enforces a multiple-reader, single-writer locking discipline: more than one reader can access the shared resource concurrently so long as none of them wants to modify it, but writers must acquire the lock excusively. For read-mostly data structures, ReadWriteLock can offer greater concurrency than exclusive locking; for read-only data structures, immutability can eliminate the need for locking entirely.
Atomic variables (see Chapter 15) offer a means of reducing the cost of updating "hot fields" such as statistics counters, sequence generators, or the reference to the first node in a linked data structure. (We used AtomicLong to maintain the hit counter in the servlet examples in Chapter 2.) The atomic variable classes provide very fine-grained (and thereforemore scalable) atomic operations on integers or object references, and are implemented using low-level concurrency primitives (such as compare-and-swap) provided by most modern processors. If your class has a small number of hot fields that do not participate in invariants with other variables, replacing them with atomic variables may improve scalability. (Changing your algorithm to have fewer hot fields might improve scalability even moreatomic variables reduce the cost of updating hot fields, but they don't eliminate it.)
Monitoring CPU Utilization
When testing for scalability, the goal is usually to keep the processors fully utilized. Tools like vmstat and mpstat on Unix systems or perfmon on Windows systems can tell you just how "hot" the processors are running.
If the CPUs are asymmetrically utilized (some CPUs are running hot but others are not) your first goal should be to find increased parallelism in your program. Asymmetric utilization indicates that most of the computation is going on in a small set of threads, and your application will not be able to take advantage of additional processors.
If the CPUs are not fully utilized, you need to figure out why. There are several likely causes:
If your application is keeping the CPUs sufficiently hot, you can use monitoring tools to infer whether it would benefit from additional CPUs. A program with only four threads may be able to keep a 4-way system fully utilized, but is unlikely to see a performance boost if moved to an 8-way system since there would need to be waiting runnable threads to take advantage of the additional processors. (You may also be able to reconfigure the program to divide its workload over more threads, such as adjusting a thread pool size.) One of the columns reported by vmstat is the number of threads that are runnable but not currently running because a CPU is not available; if CPU utilization is high and there are always runnable threads waiting for a CPU, your application would probably benefit from more processors.
Just Say No to Object Pooling
In early JVM versions, object allocation and garbage collection were slow, but their performance has improved substantially since then. In fact, allocation in Java is now faster than malloc is in C: the common code path for new Object in HotSpot 1.4.x and 5.0 is approximately ten machine instructions.
To work around "slow" object lifecycles, many developers turned to object pooling, where objects are recycled instead of being garbage collected and allocated anew when needed. Even taking into account its reduced garbage collection overhead, object pooling has been shown to be a performance loss for all but the most expensive objects (and a serious loss for light- and medium-weight objects) in single-threaded programs (Click, 2005).
In concurrent applications, pooling fares even worse. When threads allocate new objects, very little inter-thread coordination is required, as allocators typically use thread-local allocation blocks to eliminate most synchronization on heap data structures. But if those threads instead request an object from a pool, some synchronization is necessary to coordinate access to the pool data structure, creating the possibility that a thread will block. Because blocking a thread due to lock contention is hundreds of times more expensive than an allocation, even a small amount of pool-induced contention would be a scalability bottleneck. (Even an uncontended synchronization is usually more expensive than allocating an object.) This is yet another technique intended as a performance optimization but that turned into a scalability hazard. Pooling has its uses, but is of limited utility as a performance optimization.