Hacker News new | past | comments | ask | show | jobs | submit login

> Most problems don't need atomics to solve

How do you handle a global, concurrent memory write and/or read ?? (across many different blocks, maybe even across different grids).

For example: lets say you have a global hash table and 30,000 CUDA-threads are running. How do you insert data into the hash table safely?

> Sure, shared memory is great, but not always needed and communicating through global memory can be fast if you stay inside the L1/L2 cache and hide latency.

Scan operations (sum, min, max, AND, OR to name a few) through the parallel prefix pattern (https://en.wikipedia.org/wiki/Prefix_sum). How do you plan to do it, if not through __shared__ memory?

This is a fundamental operation in almost all GPU code I've ever seen. Just pushing / popping to a stack will require a prefix-sum to determine the size across the workgroup.

If you can't do prefix-sum, you won't be able to do effective load-balancing on a GPU. This is something that'd normally take a dozen clock ticks, but if you do it over L2 you're looking at hundreds of clock ticks instead.

------

Sorting networks are also probably best implemented in __shared__ memory... with maybe warp-level intrinsics beating them. (But warp-level programming is much much harder and I prefer to avoid it).




As I said, these features are often not needed. You can implement e.g. a neural network library without needing atomic operations.

> How do you plan to do it, if not through __shared__ memory?

Can't you use __shared__ memory the same way you use workgroup barriers and global memory? Might be slower, but good caching should make it comparable, which should be the case of prefix sum (you read right after writing, so should get good cache hit probability).


> As I said, these features are often not needed.

Global parallel hash table is probably the fastest way to implement collision detection. Its pretty fundamental to manipulation of 3d space, be it graphics, physics or other such simulations.

Which of course, run _GREAT_ on GPUs.

--------------

I've written "Inner Join" on a GPU for fun. Yes, the SQL operator. Its pretty fast. Databases probably can run on GPUs and parallelize easily. But any database would need globally consistent reads/writes. Sure, GPUs have less RAM than a CPU, but GPU-RAM is way faster so that might actually be a net benefit if your data is between 200MB and 4GB in size.

Use your imagination. Anywhere you'd use an atomic on a CPU is where you might use an atomic on a GPU.


Global hash table on GPUs sounds cursed and perverted. What you’re meant to do with them is run the same computation on all pixels in parallel independently! :P




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: