Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

michelpp · on Oct 5, 2020

This is pretty awesome work! Just a quick read so far but their approach shows great results.

As recently announced at HPEC 2020 GraphBLAS BOF, NVIDIA is contributing CUDA support to the SuiteSparse GraphBLAS library for sparse matrices. It would be interesting to see how the TCU work from this paper could influence that contribution!

Sparse matrix multiplication of an adjacency matrix is the same operation as one step in a breadth first search across a graph. This is why sparse multiplication is so important to optimize.

Dense matrices with N^2 space requirements simply can't hold or work with very large graphs, where as sparse (order N) and hypersparse (<< order N) techniques in GraphBLAS mean much higher density graph networks can be stored.

A great video on the mathematical foundations on the graphblas can be seen here:

https://www.youtube.com/watch?v=gZSNp6XbOK8

Any to plug my own work, here's an introductory video on using the GraphBLAS with python:

https://www.youtube.com/watch?v=JUbXW_f03W0

Archit3ch · on Oct 5, 2020

> NVIDIA is contributing CUDA support to the SuiteSparse GraphBLAS library for sparse matrices.

Is it under the same Apache license as SuiteSparse:GraphBLAS?

arcanus · on Oct 4, 2020

I'll be very curious if the scientific computing community will find uses of 'tensor cores'[1] for more than GEMMS. As this paper indicates, there may be more cases in HPC that can efficiently leverage small matrix multiplication instructions. This might be particularly so as NVIDIA ampere claims to have extended support to single precision, instead of just precisions of interest to the machine learning community.

[1] Tensor cores is a poor name, likely from marketing, as the units really only compute fixed, small matrix sizes.

amelius · on Oct 4, 2020

I'm also wondering, how many people are using a NumPy replacement that uses the GPU?

yablak · on Oct 5, 2020

Jax is pretty popular in the research community. Also supports automatic gradient calculations, and vectorized map. Targets the GPU via the XLA backend.

rosshemsley · on Oct 5, 2020

Pytorch is a good option here - ultimately it's just a library for efficiently manipulating tensors and supports cuda GPUs.

michelpp · on Oct 5, 2020

SuiteSparse doesn't support GPUs just yet, but when it does, you can do this with pygraphblas:

https://github.com/michelp/pygraphblas

(I am the pygraphblas author)

amelius · on Oct 5, 2020

That's cool, but I was thinking more of a drop-in replacement for NumPy / SciPy, like (I just searched for it) CuPy [1]

It would be nice if people could share some experiences.

[1] https://towardsdatascience.com/heres-how-to-use-cupy-to-make...

michelpp · on Oct 5, 2020

Numpy (and CuPy) provide dense matrices. They're super awesome and certainly very useful for many kinds of problems, but they are not useful for storing adjacency matrices for sparse graphs. That is the point of the paper and the purpose of SuiteSparse and The GraphBLAS.

Dense matrices are great, and their implementation is straightforward, a dense chunk of memory contains every element in the matrix, for an N sided square matrix, the storage requirement is N squared. Finding an element is a simple matter of indexing math. For large adjacency matrices, this is horribly inefficient, and the bigger the graph gets the worse the cache and memory locality as most elements end up being zero.

Hypersparse graphs, like say a large social network, may only have a few hundred billion edges, but trying to fit that in a dense adjacency matrix means requiring quadrillions of mostly empty elements. This is clearly impossible, so sparse matrices are required to store a large graph.

zetazzed · on Oct 5, 2020

cuPy's sparse matrix support is still more limited than its dense functionality, but it's expanding quickly in version 8.0 in particular: https://docs.cupy.dev/en/stable/reference/sparse.html

The C++/CUDA backend to cuGraph contains many low-level graph operations on really sparse graph structures as well: https://github.com/rapidsai/cugraph

sdenton4 · on Oct 5, 2020

tensorflow, perhaps?

codecoinbot · on Oct 5, 2020

The usual shameless self promotion.

rust wrappers for GraphBLAS https://github.com/fabianmurariu/rustgraphblas

Java wrappers for GraphBLAS https://github.com/fabianmurariu/graphblas-java-native/

JNI bindings are only built for Linux, if you need MacOS or Win feel free to raise an issue.

g0xA52A2A · on Oct 4, 2020

Code https://github.com/oresths/tSparse