This is pretty awesome work! Just a quick read so far but their approach shows great results.
As recently announced at HPEC 2020 GraphBLAS BOF, NVIDIA is contributing CUDA support to the SuiteSparse GraphBLAS library for sparse matrices. It would be interesting to see how the TCU work from this paper could influence that contribution!
Sparse matrix multiplication of an adjacency matrix is the same operation as one step in a breadth first search across a graph. This is why sparse multiplication is so important to optimize.
Dense matrices with N^2 space requirements simply can't hold or work with very large graphs, where as sparse (order N) and hypersparse (<< order N) techniques in GraphBLAS mean much higher density graph networks can be stored.
A great video on the mathematical foundations on the graphblas can be seen here:
I'll be very curious if the scientific computing community will find uses of 'tensor cores'[1] for more than GEMMS. As this paper indicates, there may be more cases in HPC that can efficiently leverage small matrix multiplication instructions. This might be particularly so as NVIDIA ampere claims to have extended support to single precision, instead of just precisions of interest to the machine learning community.
[1] Tensor cores is a poor name, likely from marketing, as the units really only compute fixed, small matrix sizes.
Jax is pretty popular in the research community. Also supports automatic gradient calculations, and vectorized map. Targets the GPU via the XLA backend.
Numpy (and CuPy) provide dense matrices. They're super awesome and certainly very useful for many kinds of problems, but they are not useful for storing adjacency matrices for sparse graphs. That is the point of the paper and the purpose of SuiteSparse and The GraphBLAS.
Dense matrices are great, and their implementation is straightforward, a dense chunk of memory contains every element in the matrix, for an N sided square matrix, the storage requirement is N squared. Finding an element is a simple matter of indexing math. For large adjacency matrices, this is horribly inefficient, and the bigger the graph gets the worse the cache and memory locality as most elements end up being zero.
Hypersparse graphs, like say a large social network, may only have a few hundred billion edges, but trying to fit that in a dense adjacency matrix means requiring quadrillions of mostly empty elements. This is clearly impossible, so sparse matrices are required to store a large graph.
The C++/CUDA backend to cuGraph contains many low-level graph operations on really sparse graph structures as well: https://github.com/rapidsai/cugraph
As recently announced at HPEC 2020 GraphBLAS BOF, NVIDIA is contributing CUDA support to the SuiteSparse GraphBLAS library for sparse matrices. It would be interesting to see how the TCU work from this paper could influence that contribution!
Sparse matrix multiplication of an adjacency matrix is the same operation as one step in a breadth first search across a graph. This is why sparse multiplication is so important to optimize.
Dense matrices with N^2 space requirements simply can't hold or work with very large graphs, where as sparse (order N) and hypersparse (<< order N) techniques in GraphBLAS mean much higher density graph networks can be stored.
A great video on the mathematical foundations on the graphblas can be seen here:
https://www.youtube.com/watch?v=gZSNp6XbOK8
Any to plug my own work, here's an introductory video on using the GraphBLAS with python:
https://www.youtube.com/watch?v=JUbXW_f03W0