WG211/M23Kiselyov
From WG 2.11
The Mysteries of AXPY (Oleg Kiselyov)
AXPY is one of the Basic Linear Algebra (BLAS) vector operations: vector addition aX+Y. It is a perfect target for classical optimizations like partial loop unrolling and scalar promotion. (AXPY is also embarrassingly parallel; however, this talk focuses on single-thread performance.) These optimizations are indeed carried out -- by hand -- in OpenBLAS, regarded as one of the two fastest BLAS implementations. One can make a case for automatic code generation, to reduce the tedium of applying such optimizations (given that there are many platforms and several AXPY varieties to optimize: SAXPY, DAXPY, CAXPY). This is the traditional elevator talk about metaprogramming in HPC.
How does it correspond to real life, in this day and age?