Hacker News new | past | comments | ask | show | jobs | submit login

Here's a rather trivial example of using PTX: https://docs.nvidia.com/cuda/parallel-thread-execution/#spec...

For various micro-bench reasons I wanted to use a global clock instead of an SM-local one, and I believe this was needed. Also note that even CUDA has "lower level"-like operations, e.g. warp primitives. PTX itself is super easy to embed in it like asm.






Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: