Hacker News new | past | comments | ask | show | jobs | submit login

Sparse matrix math basically boils down to indirect array references: A[B[i]]. GPUs generally trade off memory bandwidth for latency, relying on being able to do a lot of work to hide that memory latency. But because there's no work between the first and second load, you are no longer able to hide the memory latency of the second load with extra work.

CPUs, by contrast, have a thorough caching hierarchy that tends to focus on minimizing memory latency, so it doesn't take as long to do the second load compared to a GPU.




Yeah on the GPU you need to get your threads to ideally load consecutive memory locations for each thread to utilize the memory bandwidth properly. Random-indexing blows this out of the water. I guess that you could pre-process on the CPU though to pack the sparse stuff for better GPU efficiency..


You can solve around this by using cuckoo or robin hood hashing. See for example: https://www.researchgate.net/scientific-contributions/148064...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: