Conventional wisdom has that dense matrix-matrix multiply is generally faster if you use some cache-aware scheme instead of the textbook 3 nested for loops.
Is there a similar story for sparse matrices ever?
The story for sparse matrices is more complicated. Because of the dependencies imposes by sparse data structures (you don’t have random access and often have to traverse them) you cannot do loop ruling without adding if statements (I think you can probably get rid of these by preconputing, e.g. row halfway points). It may still make sense, but there’s a bigger cost. However, you can lay out data in a tiled way to get better cache blocking; taco lets you do this by storing a blocked matrix as a 4-tensor.
Because of the lack of random access, some algorithm like linear combination matrix-matrix multiplication benefits from adding a dense workspace that gives you a view into one row. Then you can scatter into it and when you’re done copy the nonzeroes to the sparse result matrix. This algorithm is sometimes called Gustavson’s algorithm after the person who published it first). We have worked out this optimization within the taco framework, it applies to other kernels too, and are now writing it up for publication.
Is there a similar story for sparse matrices ever?