
Boost : 
From: Joerg Walter (jhr.walter_at_[hidden])
Date: 20020628 00:50:01
 Original Message 
From: <lums_at_[hidden]>
To: <boost_at_[hidden]>
Sent: Friday, June 28, 2002 5:36 AM
Subject: Re: [boost] uBLAS formal review
>
>
> Hi:
>
> To follow up on this a little bit.
>
> There are basically four optimizations that need to be incorporated
> into a matrixmatrix product to get max performance (these are
> basically what the vendors do).
>
> The principles behind these optimizations: Help the compiler and the
> hardware do their stuff and code what yourself they can't do. In
> particular, you want to minimize memory accesses (and maximize reuse)
> and use the pipeline(s) in the processor.
>
> 1) Order the loops properly. If you write a matrixmatrix product
> using integer indices (i, j, k, say), there are six possible
> orderings for the loops: ijk, ikj, kij, kji, jki, jik (here we
> assume the inner loop is c(i,j) += a(i,k) * b(k,j)). The best
> ordering will depend on the orientation (row, column) of the
> matrices involved. Having k as the innermost variable could mean
> fewer loads and stores in the inner loop since the access to c(i,j)
> can be hoisted out.
uBLAS only supports two of these.
> 2) Unroll and jam the inner loop. I.e., unroll at the inner loop with
> respect to the two outer indices. This allows the compiler to
> enregister a number of quantities. The load/store count versus the
> flop count is very favorable now. An example of an unrolled and
> jammed matrixmatrix product follows:
>
> for (i = 0; i < m; i += 4) {
> for (j = 0; j < n; j += 2) {
> double w00 = C[(i )*ldc + j ];
> double w01 = C[(i )*ldc + j+1];
> double w10 = C[(i+1)*ldc + j ];
> double w11 = C[(i+1)*ldc + j+1];
> double w20 = C[(i+2)*ldc + j ];
> double w21 = C[(i+2)*ldc + j+1];
> double w30 = C[(i+3)*ldc + j ];
> double w31 = C[(i+3)*ldc + j+1];
> for (k = 0; k < p; ++k) {
> w00 += A[(i )*lda + k] * B[k*lda + j ];
> w01 += A[(i )*lda + k] * B[k*lda + j+1];
> w10 += A[(i+1)*lda + k] * B[k*lda + j ];
> w11 += A[(i+1)*lda + k] * B[k*lda + j+1];
> w20 += A[(i+2)*lda + k] * B[k*lda + j ];
> w21 += A[(i+2)*lda + k] * B[k*lda + j+1];
> w30 += A[(i+3)*lda + k] * B[k*lda + j ];
> w31 += A[(i+3)*lda + k] * B[k*lda + j+1];
> }
> C[(i )*ldc + j ] = w00;
> C[(i )*ldc + j+1] = w01;
> C[(i+1)*ldc + j ] = w10;
> C[(i+1)*ldc + j+1] = w11;
> C[(i+2)*ldc + j ] = w20;
> C[(i+2)*ldc + j+1] = w21;
> C[(i+3)*ldc + j ] = w30;
> C[(i+3)*ldc + j+1] = w31;
> }
> }
uBLAS optionally uses Duff's Device to unroll inner loops.
> 3) Block for cache. Use three more sets of indices to move block by
> block through the matrix and use the core algorithm above to do
> each block. Again, the ordering of the block indices will also
> affect performance.
uBLAS doesn't do any blocking. Blocking is somewhat difficult, because it
would reintroduce temporaries previously eliminated through expression
templates ;(
> 4) Do a block copy when doing a cacheblocked algorithm. This will
> prevent cache conflicts for larger matrix sizes.
>
> These optimizations have effects for different size ranges of
> matrices. Optimizations 1 and 2 above are important for all matrix
> sizes. Optimization 3 becomes important when the matrices won't fit
> in cache. Optimization 4 becomes important when the matrix sizes are
> large enough to cause cache conflicts. If you look at the graph
> posted on http://www.zib.de/weiser/ublas_review.gif, you will see drop
> offs in performance occurring as the matrix size increases. The first
> dropoff (at about N=50) is due to not having optimization 3, and the
> second (at about N=300) is due to not having optimization 4.
>
> Now, what makes this a touchy process is that different processors
> will require different values for the unrolling and jamming and
> different values for the cache blocking. Different matrix
> orientations can similarly require slightly different algorithms. In
> addition, different compilers respond to different expressions of
> optimizations in different ways.
>
> The ATLAS effort, for instance, is a code generator that explores the
> parameter space of optimizations and picks the best set of
> parameters. In MTL, we incorporated these optimizations (via BLAIS
> and FAST libraries) as template metaprograms so that different
> parameter sets could easily be realized.
>
> If ublas (or any linear algebra library) wants to match vendortuned
> (or ATLAStuned) performance  it will have to include the
> optimizations described here and it will have to be tunable so that
> different combinations of optimization parameters can be used on
> different architectures.
May be it's possible to reuse some BLAS procedural kernels.
Regards
Joerg
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk