April 13, 2011, 1:17 p.m.
posted by technodev
Blitz++ and Expression TemplatesBlitz++ [Veld95a], a library for high-performance array math, pioneered so many of the techniques and ideas used in this book that it would be hard to overestimate its influence on the world of C++ metaprogramming. It was the first C++ library to use explicit metaprogramming,^{[5]} and the first to implement a domain-specific embedded language. We can't possibly touch on all aspects of Blitz++, so we're going to look at the central innovation: expression templates [Veld95b].
The ProblemIf we had to boil the problem solved by Blitz++ down to a single sentence, we'd say, "A naive implementation of array math is horribly inefficient for any interesting computation." To see what we mean, take the boring statement x = a + b + c; where x, a, b, and c are all two-dimensional Arrays. The canonical implementation of Array's addition operator is: Array operator+(Array const& a, Array const& b) { std::size_t const n = a.size(); Array result; for (std::size_t row = 0; row != n; ++row) for (std::size_t col = 0; col != n; ++col) result[row][col] = a[row][col] + b[row][col]; return result; } To evaluate the expression a + b + c using that operator, we first compute a + b, resulting in a temporary Array (call it t), and then we evaluate t + c to produce the final result. The problem is that temporary, t. The efficient way to perform this computation is to step through each position in all three input arrays at once, adding the three elements at that position and placing their sum in the result: for (std::size_t row = 0; row != n; ++row)
for (std::size_t col = 0; col != n; ++col)
result[row][col] = a[row][col] + b[row][col] + c[row][col];
The temporary not only costs an extra dynamic memory allocation for its element storage, but causes the CPU to make two complete traversals of that storage: one to write the result of a + b, and another to read the input for t + c. As anyone who has done high-performance numerics knows, these two traversals are the real killer, because they destroy cache locality. If all four of the named arrays nearly fill the cache, introducing t effectively pushes one of them out. The problem here is that the operator+ signature above is just too greedy: It tries to evaluate a + b just as soon as it can, rather than waiting until the whole expression, including the addition of c, is available. Expression TemplatesIn the expression's parse tree, evaluation starts at the leaves and proceeds upwards to the root. What's needed here is some way of delaying evaluation until the library has all of the expression's parts: that is, until the assignment operator is executed. The stratagem taken by Blitz++ is to build a replica of the compiler's parse tree for the whole expression, allowing it to manage evaluation from the top down (see Figure). Figure. Parse tree for x = a + b + c
This can't be any ordinary parse tree, though: Since array expressions may involve other operations like multiplication, which require their own evaluation strategies, and since expressions can be arbitrarily large and nested, a parse tree built with nodes and pointers would have to be traversed at runtime by the Blitz++ evaluation engine to discover its structure, thereby limiting performance. Furthermore, Blitz++ would have to use some kind of runtime dispatching to handle the different combinations of operation types, again limiting performance. Instead, Blitz++ builds a compile-time parse tree out of expression templates. Here's how it works in a nutshell: Instead of returning a newly computed Array, operators just package up references to their arguments in an Expression instance, labeled with the operation: // operation tags struct plus; struct minus; // expression tree node template <class L, class OpTag, class R> struct Expression { Expression(L const& l, R const& r) : l(l), r(r) {} float operator[](unsigned index) const; L const& l; R const& r; }; // addition operator template <class L, class R> Expression<L,plus,R> operator+(L const& l, R const& r) { return Expression<L,plus,R>(l, r); } Notice that when we write a + b, we still have all the information needed to do the computationit's encoded in the type Expression<Array,plus,Array>and the data is accessible through the expression's stored references. When we write a + b + c, we get a result of type: Expression<Expression<Array,plus,Array>,plus,Array> and the data is still accessible through the nested references. The interesting thing about the Expression class template is that, just like an Array, it supports indexing via operator[]. But wait! Didn't we just tell you that operator+ computes nothing, and Expression just stores references to its arguments? Yes, we did. If the result of the operation isn't stored in the Expression, it must be computed lazily by operator[]. To see how it works, check out this simplified implementation for one-dimensional Arrays of floats. First, to associate the elementwise arithmetic logic with the operation tags, we'll nest some static member functions: // operation tags implement elementwise arithmetic struct plus { static float apply(float a, float b) { return a + b; } }; struct minus { static float apply(float a, float b) { return a - b; } }; Next, we'll give the Expression an indexing operator that calls its tag's apply function to compute the appropriate element value: // expression tree node template <class L, class OpTag, class R> struct Expression { Expression(L const& l, R const& r) : l(l), r(r) {} float operator[](unsigned index) const { return OpTag::apply(l[index], r[index]); } L const& l; R const& r; }; This seems almost too simple, right? Amazingly, we now have fully lazy expression evaluation. To see it at work, let's walk through the evaluation of (a + b)[1]. Since the type of a + b is Expression<Array,plus,Array>, we have: (a + b)[1] == plus::apply(a[1], b[1]) == a[1] + b[1] Now consider what we'd go through to evaluate the same expression with a greedy strategy. That's right, we'd have to compute a temporary array, (a + b), only to throw out all but one element! The contrast in efficiency couldn't be more striking. Naturally, (a + b + c)[1] is also computed without any temporary Arrays: (a + b + c)[1] == ((a + b) + c)[1] == plus::apply((a + b)[1], c[1]) == plus::apply(plus::apply(a[1], b[1]), c[1]) == plus::apply(a[1] + b[1], c[1]) == (a[1] + b[1]) + c[1] All that remains now is to implement Array's assignment operator. Since we can access any single element of the result Expression without ever creating a temporary Array, we can compute the whole result by accessing every element of the expression: template <class Expr> Array& Array::operator=(Expr const& x) { for (unsigned i = 0; i < this->size(); ++i) (*this)[i] = x[i]; return *this; } That's it! Naturally, there's a lot more to array math than addition and subtraction, and Blitz++ has to consider all kinds of things that are not handled by our simple example, from operations like multiplication to "tiling" large array operations so that they stay within the cache. The basic technique of delaying expression evaluation, however, is the tool that allows the library to do all these things with near-optimal efficiency.^{[6]}
As a DSL, this part of Blitz++ is deceptive in its smoothness: The syntax looks exactly as you'd expect it to in a naive implementation, but you can see that behind the syntax lives a highly specialized evaluation engine, tuned for the Blitz++ domain.
More Blitz++ MagicJust in case you find it hard to see the domain-specific language in what we've already covered, here are just a couple more of Blitz++'s syntactic innovations that we think you'll find more striking. 3.1 Array InitializationBecause Blitz++ Arrays are not what the C++ standard calls an "aggregate" (see section 8.5.1 of the Standard), we can't use the convenient syntax of listing initializers within braces, as we can with ordinary built-in arrays. Instead, Blitz++ overloads the comma operator to make a similar syntax possible: Array<float,2> A(3,3); A = 1, 2, 3, 4, 5, 6, 7, 8, 9; 3.2 SubArray SyntaxBlitz++ has a Range class that encapsulates a sequence of indices. When an Array is indexed using a Range, a lazy SubArray view is produced without copying any elements:^{[7]}
// add the first two rows and columns of A to B B += A(Range(0,2), Range(0,2)) The exciting thing about Blitz++'s Range objects is that you can also perform arithmetic on them directly, resulting in expressions that look remarkably like the body of a multiply-nested loop, using a technique pioneered by the Math.h++ library [KV89]. This example is taken from the online Blitz++ manual: // a three-dimensional stencil (used in solving PDEs) Range I(1,6), J(1,6), K(1,6); B = (A(I,J,K) + A(I+1,J,K) + A(I-1,J,K) + A(I,J+1,K) + A(I,J-1,K) + A(I,J+1,K) + A(I,J,K+1) + A(I,J,K-1)) / 7.0; This sort of notational simplification has proven itself to be more than mere syntactic sugar. Similar techniques have been used to reduce the evaluation of complex tensor equations from unreadable and error-prone code resembling FORTRAN (called "C++tran" by Scott Haney) to one-liners resembling the equations in the original theory [Land01]. Projects that would be practically impossible to complete correctly using C++tran suddenly become tractable with a DSEL. |
- Comment