WG211/M13Kamin
I'll be giving an update on our work on optimizing SpMV (sparse matrix/vector) multiplication by specialization w.r.t the matrix. I'll review some of the methods we've used for generating code. We have new timings on both run time and compile time on multi-core. Parallel speed-ups are given relative to Intel's MKL library. Parallel times are quite good; on every machine and matrix, at least one of our generative methods is faster than MKL. As previously reported in the sequential case, the best method varies according to both machine and matrix, and our effort to quickly determine the best method in a particular situation is on-going work. However, the parallel case gives us different results from the sequential; even though the method of achieving parallelism is just to split the matrix into sets of rows, methods that work well sequentially don't necessarily work well in parallel. For example, simple unfolding of the multiplication loop is much more often a good method in the parallel case; similarly, the method we call "genOSKI" - a generative version of the register-blocking method of the OSKI project - is almost never a good performer on sequential code, but often is the best in parallel.
This is work I was doing with Maria Garzaran (Illinois) and Baris Aktemur (Ozyegin Univ., Istanbul) when I was at Illinois. I am now at Google. Much of this talk - in particular, the timings - is about results obtained by them since my departure last summer.