April 5, 2011, 11:26 a.m.
posted by datamaker
Avoiding Performance Testing Pitfalls
In theory, developing performance tests is easyfind a typical usage scenario, write a program that executes that scenario many times, and time it. In practice, you have to watch out for a number of coding pitfalls that prevent performance tests from yielding meaningful results.
The timing of garbage collection is unpredictable, so there is always the possibility that the garbage collector will run during a measured test run. If a test program performs N iterations and triggers no garbage collection but iteration N + 1would trigger a garbage collection, a small variation in the size of the run could have a big (but spurious) effect on the measured time per iteration.
There are two strategies for preventing garbage collection from biasing your results. One is to ensure that garbage collection does not run at all during your test (you can invoke the JVM with -verbose:gc to find out); alternatively, you can make sure that the garbage collector runs a number of times during your run so that the test program adequately reflects the cost of ongoing allocation and garbage collection. The latter strategy is often betterit requires a longer test and is more likely to reflect real-world performance.
Most producer-consumer applications involve a fair amount of allocation and garbage collectionproducers allocate new objects that are used and discarded by consumers. Running the bounded buffer test for long enough to incur multiple garbage collections yields more accurate results.
Writing and interpreting performance benchmarks for dynamically compiled languages like Java is far more difficult than for statically compiled languages like C or C++. The HotSpot JVM (and other modern JVMs) uses a combination of bytecode interpretation and dynamic compilation. When a class is first loaded, the JVM executes it by interpreting the bytecode. At some point, if a method is run often enough, the dynamic compiler kicks in and converts it to machine code; when compilation completes, it switches from interpretation to direct execution.
The timing of compilation is unpredictable. Your timing tests should run only after all code has been compiled; there is no value in measuring the speed of the interpreted code since most programs run long enough that all frequently executed code paths are compiled. Allowing the compiler to run during a measured test run can bias test results in two ways: compilation consumes CPU resources, and measuring the run time of a combination of interpreted and compiled code is not a meaningful performance metric. Figure shows how this can bias your results. The three timelines represent the execution of the same number of iterations: timeline A represents all interpreted execution, B represents compilation halfway through the run, and C represents compilation early in the run. The point at which compilation runs seriously affects the measured per-operation runtime.
5. Results Biased by Dynamic Compilation.
Code may also be decompiled (reverting to interpreted execution) and recompiled for various reasons, such as loading a class that invalidates assumptions made by prior compilations, or gathering sufficient profiling data to decide that a code path should be recompiled with different optimizations.
One way to to prevent compilation from biasing your results is to run your program for a long time (at least several minutes) so that compilation and interpreted execution represent a small fraction of the total run time. Another approach is to use an unmeasured "warm-up" run, in which your code is executed enough to be fully compiled when you actually start timing. On HotSpot, running your program with -XX:+PrintCompilation prints out a message when dynamic compilation runs, so you can verify that this is prior to, rather than during, measured test runs.
Running the same test several times in the same JVM instance can be used to validate the testing methodology. The first group of results should be discarded as warm-up; seeing inconsistent results in the remaining groups suggests that the test should be examined further to determine why the timing results are not repeatable.
The JVM uses various background threads for housekeeping tasks. When measuring multiple unrelated computationally intensive activities in a single run, it is a good idea to place explicit pauses between the measured trials to give the JVM a chance to catch up with background tasks with minimal interference from measured tasks. (When measuring multiple related activities, however, such as multiple runs of the same test, excluding JVM background tasks in this way may give unrealistically optimistic results.)
Unrealistic Sampling of Code Paths
Runtime compilers use profiling information to help optimize the code being compiled. The JVM is permitted to use information specific to the execution in order to produce better code, which means that compiling method M in one program may generate different code than compiling M in another. In some cases, the JVM may make optimizations based on assumptions that may only be true temporarily, and later back them out by invalidating the compiled code if they become untrue.
As a result, it is important that your test programs not only adequately approximate the usage patterns of a typical application, but also approximate the set of code paths used by such an application. Otherwise, a dynamic compiler could make special optimizations to a purely single-threaded test program that could not be applied in real applications containing at least occasional parallelism. Therefore, tests of multithreaded performance should normally be mixed with tests of single-threaded performance, even if you want to measure only singlethreaded performance. (This issue does not arise in TimedPutTakeTest because even the smallest test case uses two threads.)
Unrealistic Degrees of Contention
Concurrent applications tend to interleave two very different sorts of work: accessing shared data, such as fetching the next task from a shared work queue, and thread-local computation (executing the task, assuming the task itself does not access shared data). Depending on the relative proportions of the two types of work, the application will experience different levels of contention and exhibit different performance and scaling behaviors.
If N threads are fetching tasks from a shared work queue and executing them, and the tasks are compute-intensive and long-running (and do not access shared data very much), there will be almost no contention; throughput is dominated by the availability of CPU resources. On the other hand, if the tasks are very short-lived, there will be a lot of contention for the work queue and throughput is dominated by the cost of synchronization.
To obtain realistic results, concurrent performance tests should try to approximate the thread-local computation done by a typical application in addition to the concurrent coordination under study. If the the work done for each task in an application is significantly different in nature or scope from the test program, it is easy to arrive at unwarranted conclusions about where the performance bottlenecks lie. We saw in Section 11.5 that, for lock-based classes such as the synchronized Map implementations, whether access to the lock is mostly contended or mostly uncontended can have a dramatic effect on throughput. The tests in that section do nothing but pound on the Map; even with two threads, all attempts to access the Map are contended. However, if an application did a significant amount of thread-local computation for each time it accesses the shared data structure, the contention level might be low enough to offer good performance.
In this regard, TimedPutTakeTest may be a poor model for some applications. Since the worker threads do not do very much, throughput is dominated by coordination overhead, and this is not necessarily the case in all applications that exchange data between producers and consumers via bounded buffers.
Dead Code Elimination
One of the challenges of writing good benchmarks (in any language) is that optimizing compilers are adept at spotting and eliminating dead codecode that has no effect on the outcome. Since benchmarks often don't compute anything, they are an easy target for the optimizer. Most of the time, it is a good thing when the optimizer prunes dead code from a program, but for a benchmark this is a big problem because then you are measuring less execution than you think. If you're lucky, the optimizer will prune away your entire program, and then it will be obvious that your data is bogus. If you're unlucky, dead-code elimination will just speed up your program by some factor that could be explained by other means.
Dead-code elimination is a problem in benchmarking statically compiled languages too, but detecting that the compiler has eliminated a good chunk of your benchmark is a lot easier because you can look at the machine code and see that a part of your program is missing. With dynamically compiled languages, that information is not easily accessible.
Many microbenchmarks perform much "better" when run with HotSpot's -server compiler than with -client, not just because the server compiler can produce more efficient code, but also because it is more adept at optimizing dead code. Unfortunately, the dead-code elimination that made such short work of your benchmark won't do quite as well with code that actually does something. But you should still prefer -server to -client for both production and testing on multiprocessor systemsyou just have to write your tests so that they are not susceptible to dead-code elimination.
In PutTakeTest, we compute the checksum of elements added to and removed from the queue and combine these checksums across all the threads, but this could still be optimized away if we do not actually use the checksum value. We happen to need it to verify the correctness of the algorithm, but you can ensure that a value is used by printing it out. However, you should avoid doing I/O while the test is actually running, so as not to distort the run time measurement.
A cheap trick for preventing a calculation from being optimized away without introducing too much overhead is to compute the hashCode of the field of some derived object, compare it to an arbitrary value such as the current value of System. nanoTime, and print a useless and ignorable message if they happen to match:
if (foo.x.hashCode() == System.nanoTime()) System.out.print(" ");
The comparison will rarely succeed, and if it does, its only effect will be to insert a harmless space character into the output. (The print method buffers output until println is called, so in the rare case that hashCode and System.nanoTime are equal no I/O is actually performed.)
Not only should every computed result be used, but results should also be unguessable. Otherwise, a smart dynamic optimizing compiler is allowed to replace actions with precomputed results. We addressed this in the construction of PutTakeTest, but any test program whose input is static data is vulnerable to this optimization.