June 22, 2011, 10:01 p.m.

posted by jackexe

### Using Metaprograms to Unroll Loops

One of the first practical applications of metaprogramming was the unrolling of loops for numeric computations, which is shown here as a complete example.

Numeric applications often have to process n-dimensional arrays or mathematical vectors. One typical operation is the computation of the so-called dot product. The dot product of two mathematical vectors `a` and `b` is the sum of all products of corresponding elements in both vectors. For example, if each vectors has three elements, the result is

a[0]*b[0] + a[1]*b[1] + a[2]*b[2]

A mathematical library typically provides a function to compute such a dot product. Consider the following straightforward implementation:

// meta/loop1.hpp #ifndef LOOP1_HPP #define LOOP1_HPP template <typename T> inline T dot_product (int dim, T* a, T* b) { T result = 0; for (int i=0; i<dim; ++i) { result += a[i]*b[i]; } return result; } #endif // LOOP1_HPP

When we call this function as follows

```
// meta/loop1.cpp
#include <iostream>
#include "loop1.hpp"
int main()
{
int a[3] = { 1, 2, 3 };
int b[3] = { 5, 6, 7 };
std::cout << "dot_product(3,a,b) = " << dot_product(3,a,b)
<< '\n';
std::cout << "dot_product(3,a,a) = " << dot_product(3,a,a)
<< '\n';
}
```

we get the following result:

dot_product(3,a,b) = 38 dot_product(3,a,a) = 14

This is correct, but it takes too long for serious high-performance applications. Even declaring the function inline is often not sufficient to attain optimal performance.

The problem is that compilers usually optimize loops for many iterations, which is counterproductive in this case. Simply expanding the loop to

a[0]*b[0] + a[1]*b[1] + a[2]*b[2]

would be a lot better.

Of course, this performance doesn't matter if we compute only some dot products from time to time. But, if we use this library component to perform millions of dot product computations, the differences become significant.

Of course, we could write the computation directly instead of calling `dot_product()`, or we could provide special functions for dot product computations with only a few dimensions, but this is tedious. Template metaprogramming solves this issue for us: We "program" to unroll the loops. Here is the metaprogram:

// meta/loop2.hpp #ifndef LOOP2_HPP #define LOOP2_HPP // primary template template <int DIM, typename T> class DotProduct { public: static T result (T* a, T* b) { return *a * *b + DotProduct<DIM-1,T>::result(a+1,b+1); } }; // partial specialization as end criteria template <typename T> class DotProduct<1,T> { public: static T result (T* a, T* b) { return *a * *b; } }; // convenience function template <int DIM, typename T> inline T dot_product (T* a, T* b) { return DotProduct<DIM,T>::result(a,b); } #endif // LOOP2_HPP

Now, by changing your application program only slightly, you can get the same result:

```
// meta/loop2.cpp
#include <iostream>
#include "loop2.hpp"
```

int main() { int a[3] = { 1, 2, 3}; int b[3] = { 5, 6, 7}; std::cout << "dot_product<3>(a,b) = " << dot_product<3>(a,b) << '\n'; std::cout << "dot_product<3>(a,a) = " << dot_product<3>(a,a) << '\n'; }

Instead of writing

dot_product(3,a,b)

we write

dot_product<3>(a,b)

This expression instantiates a convenience function template that translates the call into

DotProduct<3,int>::result(a,b)

And this is the start of the metaprogram.

Inside the metaprogram the `result` is the product of the first elements of `a` and `b` plus the `result` of the dot product of the remaining dimensions of the vectors starting with their next elements:

template <int DIM, typename T> class DotProduct { public: static T result (T* a, T* b) { return *a * *b + DotProduct<DIM-1,T>::result(a+1,b+1); } };

The end criterion is the case of a one-dimensional vector:

template <typename T> class DotProduct<1,T> { public: static T result (T* a, T* b) { return *a * *b; } };

Thus, for

dot_product<3>(a,b)

the instantiation process computes the following:

DotProduct<3,int>::result(a,b) = *a * *b + DotProduct<2,int>::result(a+1,b+1) = *a * *b + *(a+1) * *(b+1) + DotProduct<1,int>::result(a+2,b+2) = *a * *b + *(a+1) * *(b+1) + *(a+2) * *(b+2)

Note that this way of programming requires that the number of dimensions is known at compile time, which is often (but not always) the case.

Libraries, such as Blitz++ (see [Blitz++]), the MTL library (see [MTL]), and POOMA (see [POOMA]), use these kinds of metaprograms to provide fast routines for numeric linear algebra. Such metaprograms often do a better job than optimizers because they can integrate higher-level knowledge into the computations.
^{[2]} The industrial-strength implementation of such libraries involves many more details than the template-related issues we present here. Indeed, reckless unrolling does not always lead to optimal running times. However, these additional engineering considerations fall outside the scope of our text.

^{[2]}In some situations metaprograms significantly outperform their Fortran counterparts, even though Fortran optimizers are usually highly tuned for these sorts of applications.

- Comment