Hacker News new | past | comments | ask | show | jobs | submit login
GPU Survival Toolkit for the AI age (hexmos.com)
291 points by lordwiz 10 months ago | hide | past | favorite | 159 comments



The code in this article is incorrect. The CUDA kernel is never called: https://github.com/RijulTP/GPUToolkit/blob/f17fec12e008d0d37...

I'd also like to point out that 90 % of the time spent to "compute" the Mandelbrot set with the JIT-compiled code is spent on compiling the function, not on computation.

If you actually want to learn something about CUDA, implementing matrix multiplication is a great exercise. Here are two tutorials:

https://cnugteren.github.io/tutorial/pages/page1.html

https://siboehm.com/articles/22/CUDA-MMM


>If you actually want to learn something about CUDA, implementing matrix multiplication is a great exercise.

There is SAXPY (matrix math A*X+Y), purportedly ([1]) the hello world of parallel math code.

>SAXPY stands for “Single-Precision A·X Plus Y”. It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i].

[1]: https://developer.nvidia.com/blog/six-ways-saxpy/


Thank you for this, comments like yours is exactly why I keep coming back to HN.


Thanks a lot for pointing it out, I have fixed the code and updated the blog.


This article claims to be something every developer must know, but it's a discussion of how GPUs are used in AI. Most developers are not AI developers, nor do they interact with AI or use GPUs directly. Not to mention the fact that this articles barely mentions 3d graphics at all, the reason gpus exist


One can benefit from knowing fundamentals of an adjacent field, especially something as broadly applicable as machine learning.

- You might want to use some ML in the project you are assigned next month

- It can help collaborating with someone who tackles that aspect of a project

- Fundamental knowledge helps you understand the "AI" stuff being marketed to your manager

The "I don't need this adjacent field" mentality feels familiar from schools I went to: first I did system administration where my classmates didn't care about programming because they felt like they didn't understand it anyway and they would never need it (scripting, anyone?); then I switched to a software development school where, guess what, the kids couldn't care about networking and they'd never need it anyway. I don't understand it, to me it's both interesting, but more practically: fast-forward five years and the term devops became popular in job ads.

The article is 1500 words at a rough count. Average reading speed is 250wpm, but for studying something, let's assume half of that: 1500/125 = 12 minutes of your time. Perhaps you toy around with it a little, run the code samples, and spend two hours learning. That's not a huge time investment. Assuming this is a good starting guide in the first place.


The objection isn't to the notion that "One can benefit from knowing fundamentals of an adjacent field". It's that this is "The bare minimum every developer must know". That's a much, much stronger claim.

I've come to see this sort of clickbait headline as playing on the prevalence of imposter-syndrome insecurity among devs, and try to ignore them on general principle.


Fair enough! I can kind of see the point that, if every developer knew some basics, it would help them make good decisions about their own projects, even if the answer is "no, this doesn't need ML". On the other hand, you're of course right that if you don't use ML, then it's clearly not something you "must" know to do your job well.


> Most developers are not AI developers

I remember how I joined a startup after working for a traditional embedded shop and a colleague made (friendly) fun of me for not knowing how to use curl to post a JSON request. I learned a lot since then about backend, frontend and infrastructure despite still being an embedded developer. It seems likely that people all around the industry will be in a similar position when it comes to AI in the next years.


Most AI work will just be APIs provided by your cloud provider in less than 2 years. Understanding what's going on under the hood isn't going to be that common, maybe the AI equivalent of "use explain analyze, optimize indexes" will be what passes for (engineering, not scientist) AI expert around that time.


Most things provided by your cloud cloud provider are just slightly modified and pre-packaged versions of software you can run anyway. Postgres on EC2 is a perfectly viable alternative to whatever Amazon offers.


AWS Glue is just Apache Spark but you can’t debug it when it errors because all that is “conveniently” obfuscated from you.

SageMaker is JupyterLab with a GPU attached.

Cognito is just OAuth.

And of course networking is fucked… somehow AWS made it more complicated than the real thing, like they abstracted in the opposite direction.


If so it will create more consumers than creators


What do you think the industry will look like in the near future?


Not to mention their passing example of Mandelbrot set rendering only gets a 10x speedup, despite being the absolute posterchild of FLOPs-limited computation.

Terrible article IMO.


You would expect at least 1000x, and that's probably where it would be if they didn't include JIT compile time in their time. Mandelbrot sets are a perfect example of a calculation a GPU is good at.


yeah a lot of assumptions were made that are inaccurate.

I agree that most developers are not AI developers... OP seems to be a bit out of touch with the general population and otherwise is assuming the world around them based on their own perception.


I've noticed that every time I see an article claiming that its subject is something "every developer must know", that claim is false. Maybe there are articles which contain information that everyone must know, but all I encounter is clickbait.


Understanding now hardware is used is very beneficial for programmers

Lost of programmers started with an understanding of what happens physically on the hardware when code runs and it is unfair advantage when debugging at times


> Understanding now hardware is used is very beneficial for programmers

I agree, but to say that all developers must know how AI benefits from GPUs is a different claim. One which is false. I would say most developers don't even understand how the CPU works, let alone modern CPU features like Data/Instruction Caching, SIMD instructions, and Branch prediction.

Most developers I encounter learned Javascript and make websites


Ok you’re entitled to that belief.

AI is also just software that runs on hardware.

GPUs are just hardware


Even worse, it says "GPUs", but isn't CUDA a closed feature limited to Nvidia cards, and maybe even a subset of them ?

(I'm not touching Nvidia since they don't provide open source drivers.)


I would have probably opened it if it weren't for the title bait.


And honestly, for most "AI developers" if you are training your own model these days (versus using an already trained one) - you are probably doing it wrong.


Don't worry, you'll either be an AI developer or unemployed within 5 years. This is indeed important for you, regardless if you recognize this yet or not.


Just like crypto right


I think python is dominant in AI, because the python-C relationship mirrors the CPU-GPU relationship.

GPUs are extremely performant, and also very hard to code in, so people just use highly abstracted API calls like pytorch to command the GPU.

C is very performant, and hard to code in, so people just use python as a abstraction layer over C.

Its not clear if people need to understand GPUs that much (Unless you are deep in AI training/ops land). In time, since moore's law has ended and multithreading becomes the dominant mode of speed increases, there'll probably be brand new languages dedicated to this new paradigm of parallel programming. Mojo is a start.


I've wondered for a while: is there a space for a (new?) language which invisibly maximises performance, whatever hardware it is run on?

As in, every instruction, from a simple loop of calculations onward, is designed behind the scenes so that it intelligently maximises usage of every available CPU core in parallel, and also farms everything possible out to the GPU?

Has this been done? Is it possible?


Not exactly it but Mojo sounds closest from available options

https://www.modular.com/mojo


There's definitely a space for it. It may even be possible. But if you consider the long history of lisp discussions (flamewars?) about "a sufficiently smart compiler" and comparisons to C. Or maybe Java vs C++, it seems unlikely. At least very very difficult.

There are little bits of research on algorithm replacement. Like, have the compiler detect that you're trying to sort, and generate the code for quick sort or timsort. it works, kinda. There are a lot of ways to hide a sort in code, and the compiler can't readily find them all.


Not for mixed CPU/GPU, but there is the concept of a superoptimizer, that basically brute forces for the most optimal correct code. But it is not practical, besides using for very very short program snippets (and they are usually CPU-only, though there is nothing fundamental why it couldn’t utilize the GPU as well).

There is also https://futhark-lang.org/ , though I haven’t tried it, just heard about it.


I'm not sure that's even possible in principle; consider the various anti-performance algorithms of proof-of-waste systems, where every step is data-dependent on the previous one and the table of intermediate results required may be made arbitrarily big.

It's a bit like "design a zip algorithm which can compress any file".


I don’t see why such a “proof of waste” algorithm would be an obstacle to such an optimizer existing. Wouldn’t it just be that for such computational problems, the optimal implementation would still be rather costly? That doesn’t mean the optimizer failed. If it made the program as efficient as possible, for the computational task it implements, then the optimizer has done its job.


I'd imagine it wouldn't be very difficult to build language constructs that are able to denote when high parallelism is desirable; and let the compiler deal with this information as necessary.


I’m not sure if that’s a good idea at the moment, but we should start with making development with vector instructions more approachable. The code should look more or less the same as working with u64s.


You might be interested in https://github.com/HigherOrderCO/HVM


HVM looks very interesting. Thx for posting.


There are many languages doing that more or less. Jax and Mojo for example.


Moore’s law is far from over and multithreading is not the answer. Your opening sentence is spot on tho.


> Moore’s law is far from over and multithreading is not the answer.

Wut? We hit the power wall back in 2004. There was a little bit of optimization around the memory wall and ilp wall afterwards, but really, cores haven't gotten faster since.

It's been all about being able to cram more cores in since then, which implies at least multi-threading, but multi-processing is basically required to get the most out of a cpu these days.


Moore's law is "the observation that the number of transistors in an integrated circuit doubles about every two years". For a while clock speed was a proxy for that metric, but it's not the 'law' itself.


Yeah, but today number of cores is the rough proxy for that metric.

How do you operate in that world if "multithreading isn't the answer"?


Modern CPUs contain a lot more computing units than cores. For a while, hyperthreading was thought to be a useful way to make use of them. More recently, people have turned to advanced instruction sets like SSE and AVX.


Those things aren’t mutually exclusive. Also firstly, I suspect there’s more “low hanging fruit” in making more software make use of more cores. We’re increasingly getting better languages, tooling and libs for multi threading stuff, and it’s far more in the realm of your average developer than writing SIMD compatible code and making sure your code can pipeline properly.


Threads are of course appropriate to implement high-level concurrency and parallelism. But for fine-grained parallelism, they are unwieldy and have high overhead.

Spreading an algorithm across multiple threads makes it more difficult for an optimizing compiler to find opportunities for SIMD optimization.

Similarly to how modern languages make it easier to safely use threads, runtimes also make it easier to take advantage of SIMD optimizations. For example, recently a SIMD-optimized sorting algorithm was included in OpenJDK. Apart from that, SIMD is way less brittle at runtime than GPUs and other accelerators.


Care to elaborate?


No idea about the future of the Moore's law precisely. Yet recent research results show that there is still room for faster semiconductors, as discussed on HN [0].

[0]: https://news.ycombinator.com/item?id=38201624


I mean why isn’t multithreading the answer?


Multithreading has the following disadvantages:

* Overhead: the overhead to start and manage multiple threads is considerable in practice. Most multithreaded algorithms are in fact slower than optimized serial implementations when n_threads=1

* Communication: threads have to communicate with each other and synchronize access to shared resources. "Embarrassingly parallel" problems don't require synchronization, but many interesting problems are not of that kind.

* Amdahl's law: there is a point of diminishing returns on parallelizing an application since it quite likely contains parts that are not easily parallelized.


You forgot Wirth's law: software complexity compensates increase in speed of hardware.


Well, yes. But assuming the question is how to reduce overall latency, what alternative is there besides algorithmic improvements or increasing clock frequency?


Recently, SIMD has been increasingly used to make use of the increased number of compute units in a CPU. Sure, it's not as easy and straightforward to use as multithreading (I never expected to say that about multithreading), but libraries and programming languages make more and more use of them.

Edit: latency is difficult. But accellerating CPUs and using GPUs for compute was never about latency. Most I/O bottlenecks are because CPUs have sped up so much and left the rest of the platform in the dust. Much of it is also due to fundamental limitations due to the speed of light. Increasing throughput is always easier that reducing latency.


I guess I was thinking more of multithreading as an umbrella term for parallelism but I see your point. If you squint, SIMD and multithreading are the same problem. What I mean is, if you have a computable function you want to evaluate on a finite amount of bits, the problem of how to physically layout and schedule the computation to minimize latency is very similar to maximizing throughput.

From a circuit complexity standpoint, you could evaluate a wide but shallow circuit to evaluate the function in a nanosecond, or a deep but narrow circuit that takes eons. Whether parts of that circuit are evaluated synchronously or asynchronously is immaterial, although synchronous computation does seem easier to reason about and the UX is more user-friendly from a programmability standpoint.

I agree with you the fundamental limitation is the speed of light if you are width-bounded (i.e., if physical space is the dominating constraint).


GPUs aren't that difficult to program. CUDA is fairly straightforward for many tasks and in many cases there is an easy 100x improvement in processing speed just sitting there to be had with <100 lines of code.


Sure, but if you've never used CUDA or any other GPU framework, how many lines of documentation do you need to read, and how many lines of code are you likely to write and rewrite and delete before you end up with those <100 lines of working code?


They have nice getting started guides. Try it and see. It's... really pretty simple. There is a reason they've built a trillion $$ company - they've done a great job.


Ok, I tried it (for as long as my limited free time and interest in CUDA allows). The closest I came to a getting started guide is this [0], which by my (perhaps naive count) is 25561 lines of documentation, and I would probably need to learn more C++ to understand it in detail.

I'm sure CUDA is great, and if I had more free time and/or better reasons to improve the performance of my code it would probably be great for me. My point was mainly that a few lines of code which may be trivial for one person to write may not be for someone else with different experience. Depending on what the code is being used for even a vast increase in performance may not be worth the extra time it takes to implement it.

[0] https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....



I haven’t seen it but I can believe it - parallel programming of any variety is hard, and to be successful as a vendor of such a system would require uncommonly good API design to get the kind of traction that CUDA has gotten.


> Sure, but if you've never used CUDA or any other GPU framework, how many lines of documentation do you need to read, and how many lines of code are you likely to write and rewrite and delete before you end up with those <100 lines of working code?

If you're already familiar with one of the languages that the nvidia compiler supports? Not that many. For people familiar with C or C++, it's a couple extra attributes, and a slightly different syntax for launching kernels vs a regular function call. I'm admittedly not experienced with Fortran, which is the other language they support, so I can't speak to that. There's c-style memory allocation functions, which might be annoying to C++ devs, but it's nothing that would confuse them.

Edit: There's also a couple weird magic globals you have access to in a kernel (blockIdx, blockDim, threadIdx), but those are generally covered in the intros.


My experience is different. I had to code various instances of "parallel reduction" and "prefix sum" and it's not easy to get into it (took me a day or two). Moreover, coming from an age where 640KB of RAM where considered quite enough, truly realizing the power of the GPU was not easy because my tasks are not quite parallel and not quite thread coherent (I grant you that doing naturally parallel stuff is dead simple). It took me a while and a lot of nvidia-nsight to max out the GPU... Moreover, I was a bit slow to actually understand how powerful a GPU is (for example, my GPU gave poor performances unless I gave it a problem big enough, so I was (wrongly) disappointed when testing toy problems)

But once that challenge is overcome, GPU truly rocks.

Finally debugging complex shaders (I do some specific case of computational fluid dynamics where equations are not that easy, full of "if/then" edge cases, etc) is not fun at all, tooling is sorely missed (unless I've missed something)


This is fair, and if you've got the time and inclination, I'd love to hear about your experience and the tricks you ended up pulling. There are definitely advanced areas of CUDA, and you can go deeper on nearly anything.

But we are in a comment chain spawned by:

> CUDA is fairly straightforward for many tasks and in many cases there is an easy 100x improvement in processing speed just sitting there to be had with <100 lines of code.

And a follow up comment about how easy it would be to write that "<100 lines of code", so I feel like we're definitely talking about the easy case of naturally parallel calculations, and sticking to that as an intro seems fair to me.


There are libraries that help with more tricky stuff on-device like cub or cuFFTDx.


Sure, but there's value in understanding how to do it yourself, even if you don't use it at work.

And there's also value in seeing how other people approached a problem.


Parallel reduction and prefix sums are exercises to train people how to reformulate algorithms for the GPU and how to identify performance bottlenecks specific to GPUs. In practice, you'd use library functions.


You can pretty easily get C performance (I'd argue that C's lack of abstraction makes slow but simple code more appealing) with pythonic expressiveness pretty easily with a more modern language.


> C is very performant, and hard to code in, so people just use python as a abstraction layer over C.

C is a way of life. Those of us who code exclusively, or nearly so, in C cannot stomach python's notion of "significant white-space."


I code in both (c for hobbies and python professionally) and “significant white space” is a non-issue if you spend any amount of time getting used to it.

Complaining about significant white-space is like complaining that lisp has too many parentheses. It’s an aesthetic preference that just doesn’t matter in practice.


A form of Sayre's Law is very common in tech (eg spaces vs tabs; framework vs framework; language vs language).


I actually find python and C very similar in spirit.

Syntax is mostly an irrelevance, they have surprisingly similar patterns in my opinion.

In a modern language I want a type system that both reduces risk and reduces typing — safety and metaprogramming. C obviously doesn't, python doesn't really either.

Python's approach to dynamic-ness is very similar to how I'd expect C to be as a dynamic language (if it had proper arrays/lists).


You get used to the significant whitespace. (C programmer since ~1978).


I never did, and it's one of the things I hate most about Python to this day. I still use Python because it's the best tool a lot of the time, but it's such a terrible language decision to have significant whitespace imo.


> cannot stomach python's notion of "significant white-space."

Why belly ache about it? Whitespace is significant to one’s fellow humans.


Precisely why it should be of no significance to the machine.


Source code is not for the machine to read, it’s for your fellow humans


That doesn't mean that source code has no meaning to the interpreter.


All programming languages, including C, have significant white space. Python just has slightly more.


Wait till you start using the black formatter tool.

Well-known for supporting any formatting style you like ;)


> When faced with multiple tasks, a CPU allocates its resources to address each task one after the other

Ha! I wish CPUs were still that simple.

Granted, it is legitimate for the article to focus on the programming model. But "CPUs execute instructions sequentially" is basically wrong if you talk about performance. (There are pipelines executing instructions in parallel, there is SIMD, and multiple cores can work on the same problem.)


I think this post focused on the wrong things here. CPUs with AVX-512 also have massive data parallelism, and CPUs can execute many instructions at the same time. The big difference is that CPUs spend a lot of their silicon and power handling control flow to execute one thread efficiently, while GPUs spend that silicon on more compute units and hide control flow and memory latency by executing a lot of threads.


It will do multipleS SIMD instructions at the same time, too.


The CPUs are good at serial code and GPUs are good at parallel code is kind of true but something of an approximation. Assume equivalent power budget in the roughly hundreds of watts range, then:

A CPU has ~100 "cores" each running one (and-a-hyperthread). independent things, and it hides memory latency by branch prediction and pipelining.

A GPU has ~100 "compute units", each running ~80 independent things interleaved, and it hides memory latency by executing the next instruction from one of the other 80 things.

Terminology is a bit of a mess, and the CPU probably has a 256bit wide vector unit while the GPU probably has a 2048bit wide vector unit, but from a short distance the two architectures look rather similar.


GPU has 10x the memory bandwidth of the CPU though, which becomes relevant for the LLMs where you essentially have to read the whole memory (if you're batching optimally, that is using all the memory either for weights or for KV cache) to produce one token of output.


GPUs also have 10x-100x FP/INT8 throughput watt-for-watt.


GPU also has 10x memory latency compared to CPU.

And memory access order is much more important that on CPU. Truly random access has very bad performance.


I’m always surprised there isn’t a movement toward pairing a few low latency cores with a large number of high throughput cores. Surround single Intel P core with a bunch of E cores. Then, hanging off the E cores, stick a bunch of iGPU cores and/or AVX-512 units.

Call it Xeon Chi.


I think one possible reason for that, ideally these things need different memory.

If you use high-bandwidth high-latency GDDR memory, CPU cores will underperform due to high latency, like there: https://www.tomshardware.com/reviews/amd-4700s-desktop-kit-r...

If you use low-latency memory, GPU cores will underperform due to low bandwidth, see modern AMD APUs with many RDNA3 cores connected to DDR5 memory. On paper, Radeon 780M delivers up to 9 FP32 TFLOPS, the figure is close to desktop version of Radeon RX 6700 which is substantially faster in gaming.


Hmm, that is a good point. Since it is a dream-computer anyway, maybe we can do 2.5d packaging; put the ddr memory right on top so the P cores can reach it quickly, then surround the whole thing with GDDR.


Neat idea, probably even viable!

I think they may have a hurdle of getting folks to buy into the concept though.

I imagine it would be analogous to how Arria FPGA’s were included with certain Xeon CPU’s. Which further backs up your point that this could happen in the near future!


You mean like an iGPU?

Edit: Oh, thanks for the downvote, with no discussion of the question. I'll just sit here quietly with my commercial OpenCL software that happily exploits these vector units attached to the normal CPU cores.


I’m not sure who downvoted; I think it isn’t possible to downvote a response to one’s comment.

I did decide not to engage because “you mean like <very common well known thing>” seemed a bit brusque and dismissive, commenting on this site is just for fun, so I don’t really see the point in continuing a conversation that seems like it is getting off on the wrong foot.


Nx / Axon

Given that most programming languages are designed for sequential processing (like CPUs), but Erlang/Elixir is designed for parallelism (like GPUs) … I really wonder if Nx / Axon (Elixir) will take off.

https://github.com/elixir-nx/


Erlang was designed for distributed systems with a lot of concurrency not for computation-heavy parallelism


I am really wondering how well Elixir with Nx would perform for computation heavy workloads on a HPC cluster. Architecturally, it isn't that dissimilar to MPI, which is often used in that field. It should be a lot more accessible though, like numpy and the entire scientific python stack.


I've been investigating this and I wonder if the combination of Elixir and Nx/Axon might be a good fit for architectures like NVIDIA Grace Hopper where there is a mix of CPU and GPU.


Would that run on a GPU? I think the future is having both. Sequential programming is still the best abstraction for most tasks that don’t require immense parallel execution


Axon runs compute graphs on gpu, but elixirs parallelism abstractions run on cpu


I need a buyers guide: what's the minimum to spend, and best at a few budget tiers? Unfortunately that info changes occasionally and I'm not sure if there's any resource that keeps on top of things.



Google Colab, Kaggle Notebooks and Paperspace Notebooks all offer free GPU usage (within limits), so you do not need to spent anything to learn GPU programming.

https://colab.google/

https://www.kaggle.com/docs/notebooks

https://www.paperspace.com/gradient/free-gpu


For learning basics of GPU programming your iGPU will do fine. Actual real-world applications are very varied of course.


You can also rent compute online if you don’t want to immediate plop down 1-2k


We’re back to “every developer must know” clickbait articles?


Although I think they'll be replaced by ChatGPT a good article in that style is actually quite valuable.

I like attacking complexity head on, and have a good knowledge of both quantitative methods & qualitative details of (say) computer hardware so having an article that can tell me the nitty gritty details of a field is appreciated.

Take "What every programmer should know about memory" — should every programmer know? Perhaps not, but every good programmer should at least have an appreciation of how a computer actually works. This pays dividends everywhere — locality (the main idea that you should take away from that article) is fast, easy to follow, and usually a result of good code that fits a problem well.


it seems so... Should take this article's statements with a grain of salt.


Good read. However, the AWS P5 instance (along with P4d and P4de) is most certainly oriented towards training, not inference. The most inference-friendly instance types are the G4dn and the G5, which feature T4 and A10G GPUs, respectively.


Came here to say this author forgot G5.


I am very new to GPU programming in general and this article was a fun read. It's amazing how far we've come, prime example being able to train a simple "dog or cat" NN that easily.


Are amd GPUs still to be avoided or are they workable at this point?


The cuda happy path is very polished and works reliably. The amdgpu happy path fights you a little but basically works. I think the amd libraries starting to be packaged under Linux is a big deal.

If you don't want to follow the happy path, on Nvidia you get to beg them to maybe support your use case in future. On amdgpu, you get the option to build it yourself, where almost all the pieces are open source and pliable. The driver ships in Linux. The userspace is on GitHub. It's only GPU firmware which is an opaque blob at present, and that's arguably equivalent to not being able to easily modify the silicon.


AMD GPUs work great, the issue is that people don't want to mess with ROCm/HIP when CUDA is kind of the documented workflow. Along with the fact that ROCm was stagnant for a long time. AMD missed the first AI wave, but are now committed to making ROCm into the best it can be.

The other problem is that there aren't any places to rent the high end AMD AI/ML GPUs, like the MI250's and soon to be released MI300's. They are only available on things like the Frontier super computer, which few developers have access to. "regular" developers are stuck without easy access to this equipment.

I'm working on the later problem. I'd like to create more of a flywheel effect. Get more developers interested in AMD by enabling them to inexpensively rent and do development on them, which will create more demand. @gmail if you'd like to be an early adopter.


We have compilers (languages) like Futhark that aim to optimize explicitly parallel operations. And universal models of computation like interaction nets that are inherently parallel.

Am I lazy to expect we’ll be getting a lot more “parallel-on-the-GPU-for-free” in the future?


You can already convert a compute graph to GPU-optimized code using something like Aesara (formerly known as Theano) or TensorFlow. There are also efforts in the systems space that ought to make this kind of thing more widespread in the future, such as the MLIR backend for LLVM.


The Mandelbrot example seems to make interpreted Python stand in for "the CPU performance"?

If that's true, then I'm surprised they only see a 10x speed-up. I would expect more from only compiling that loop for the CPU. (Comparing to interpreted Python without numpy.) Given they already have a numba version, why not compile it for the CPU and compare?

Also, they say consumer CPUs have 2-16 cores. (Who has 2 cores these days?) They go on suggest to rent an AWS GPU for $3 per hour. You're more likely to get 128 cores for that price, still on a single VM.

Not saying it will be easy to write multi-threaded code for the CPU. But if you're lucky, the Python library you're using already does it.


> Also, they say consumer CPUs have 2-16 cores. (Who has 2 cores these days?)

Pretty sure my mom's laptop has 2 cores; I can't think of anyone whose daily driver has 16 cores. Real cores, not hyperthread stuff running at 0.3× the performance of a real core.

As for the 128-core server system, note that those cores are typically about as powerful as a 2008 notebook. My decade-old laptop CPU outperforms what you get at DigitalOcean today, and storage performance is a similar story. The sheer number makes up for it, of course, but "number of cores" is not a 1:1 comparable metric.

Agree, though, that the 10x speedup seems low. Perhaps, at 0.4s, a relatively large fraction of that time is spent on initializing the Python runtime (`time python3 -c 'print("1337")'` = 60ms), the module they import that needs to do device discovery, etc.? Hashcat, for example, takes like 15 seconds to get started even if it then runs very fast after that.


My 2016 desktop had 22 cores and 44 threads. You can have the same processor for < $200 on ebay right now.


Cool, now I know one person :D

I don't think the price matters btw. If 4 cores is all I need (I can't think of a desktop application that I use which benefits from more than 4 cores more than from faster cores, which is typically the trade-off at >4 real cores) then that's what I'd get because that's the optimum for my wallet and performance profile.


Another tip that took me longer than I wished to figure out.

Use CUDA, vice graphics APIs+compute. The latter (Vulkan compute etc) is high friction. CUDA is far easier to write in, and the resulting code is easier to maintain and update.


Yes, and in the process, contribute to Nvidia’s monopoly.


Very few professional artists refuse using Photoshop or After Effects because it will "contribute to Adobe's monopoly".

But for some reasons professional programmers are judged under a much higher moral standard.


I think because professional artists can’t typically make their software tools. Whereas engineers could in theory make their own tools. Naturally few do in practice though as tech has become far too large and specialized. But our roots are where our values and ideals come from.


That is a really theoretical point.

If I start to work on a tool, then I cannot work anymore on what I actually wanted to do. And it just so happens ... that this is exactly what I did and I can just say, it usually takes way longer than the most pessimistic estimate one can come up with, so yes, one can decide to switch careers and try to get funding to (re)build what is not offered to acceptable conditions (but in my case the tool simply did not exist, though).

Just like an artist can switch career, study CS, build on his own a tool a professional company build with a team over years - and then someday work with his tool to acomplish his original work. In (simplified) theories, lots of things are possible ..


> But for some reasons professional programmers are judged under a much higher moral standard

Not in the real world. Most programmers who are trying to get a job done won’t avoid CUDA or AWS or other tools just to avoid “contributing to a monopoly”. When responsible programmers have a job to do and tools are available to help with the job, they get used.

A programmer who avoids mainstream tools on principle is liable to get surpassed by their peers very quickly. I’ve only met a few people like this in industry and they didn’t last very long trying to do everything the hard way just to avoid tools from corporations or monopolies or open source that wasn’t pure enough for their standards.

It’s only really in internet comment sections that people push ideological purity like this.


The same attitude brought us adaptation of linux. So IDK


Most lottery tickets aren't winners.


Tool choice of artists has close to 0 impact on people interacting with final work. Choices made by programmers are amplified through the users of produced software.


> But for some reasons professional programmers are judged under a much higher moral standard.

I believe the key word there is "professional" -- one of the challenges of a venue like HN is the professional engineers and the less-professional ones interact from worldviews and use cases so distinct that they may as well be separate universes. In other spaces, we wouldn't let a top doctor have to explain very basic concepts about the commercial practice of medicine to an amateur "skeptic" and yet so many discussions on HN degenerate along just these lines.

On the other hand, it's that very same inclusiveness and generally high discourse in spite of that wide expanse which make HN such a special community, so I'm not sure what to conclude besides this unfortunate characteristic being a necessary "feature, not a bug" of the community. There's no way around it that wouldn't make the community a lesser place, I think.


And they ended up with Creative Cloud bloatware.


It isn't the consumer's responsibility to regulate the market.


And, in this case NVIDIA earned it. They built a very useful software layer around their chips.


Then whose responsibility is it? Corporations? The government? Or maybe the tooth fairy?


That is a weak argument that could be used to justify tons of behavior, very convenient.

Vote with your feet. Maybe you can't or can't afford it, then at least admit the problem to yourself and maybe don't try to persuade others in order to feel better for your own decision.


If consumers don't care about their money, then who would?


Eh, CUDA can mostly be transformed to HIP, unless you use specialized NVIDIA stuff.


Is there something that is not vendor specific? Maybe a parallel programming language that compiles to different targets?

..and doesn't suck.


The part that I don't understand is why AMD/Intel/somebody else don't just implement at least the base CUDA for their products.

HIP is basically that, but they still make you jump through hoops to rename everything etc.

There are libraries written at a lower level that wouldn't be immediately portable, but surely that could be addressed over time as well.


So sayeth the person who has never written OpenCL.


I'm with you, and am surprised there isn't comparable competition.


I wish someone would make the latter much easier, as someone who is definitely not the only person to get interested in this AI stuff lately who has a high tier AMD card that could surely do this stuff and would like to run this stuff locally for various reasons.

Currently I've given up and use runpod, but still...


I agree cuda is really nice to write in, but what reason do you have to write raw cuda I’m curious? Usually I find that it’s pre written kernels you deal with


Currently doing computational chemistry, but per the article, it's a fundamental part of my toolkit going forward; I think it will be a useful tool for many applications.


What about on macOS? Is OpenCL viable?


A great beginner guide to GPU programming concepts:

https://github.com/srush/GPU-Puzzles


I don’t think transformer models generate multiple tokens in parallel (how could they?)

They just leverage parallelism in making a single prediction


Transformers tend to be trained in parallel. BERT = 512 tokens per context, in parallel. GPT too is trained while feeding in multiple words in parallel. This enables us to build larger models. Older models, such as RNNs couldn't be trained this way, limiting their power/quality.


This is only sort of true, since you can still train RNNs (including LSTM, etc.) in big batches-- which is usually plenty enough to make use of your GPU's parallel capabilities. The inherently serial part only applies to the length of your context. Transformer architectures thus happen to be helpful if you have lots of idle GPU's such that you're actually constrained by not being able to parallelize along the context dimension.


In RNN, hidden states are to be sequential; in transformers with attention mechanism, we break free of the sequential requirement. Transformers are more amenable to parallelism, and make use of GPUs the most (within the context axis, and outside).


Ahh, that makes a lot of sense


A quick search uncovers [0] with a hint towards an answer: just train the model to output multiple tokens at once.

[0] https://arxiv.org/abs/2111.12701


Shouldn't the article mention SIMD? I haven't seen it even being brought up.


Can I do this in my GeForce GTX 970 4GB?


> AWS GPU Instances: A Beginner's Guide [...] Here are the different types of AWS GPU instances and their use cases

The section goes on to teach Amazon-specific terminology and products.

A "bare minimum everyone must know" guide should not include vendor-specific guidance. I had this in school with Microsoft already, with never a mention of Linux because they already paid for Windows Server licenses for all of us...

Edit: and speaking of inclusivity, the screenshots-of-text have their alt text set to "Alt text". Very useful. It doesn't need to be verbatim copies, but it could at least summarize in a few words what you're meant to get from the terminal screenshot to help people that use screen readers.

Since this comment floated to the top, I want to also say that I didn't mean for this to dominate the conversation! The guide may not be perfect, but it helped me by showing how to run arbitrary code on my GPU. A few years ago I also looked into it, but came away thinking it's dark magic that I can't make use of. The practical examples in both high- and low-level languages are useful

Another edit: cool, this comment went from all the way at the top to all the way at the bottom, without losing a single vote. I agree it shouldn't be the very top thing, but this moderation also feels weird


Agreed. This isn’t actually that useful of a guide in the first place.

Tbh the most basic question is: “are you innovating inside the AI box or outside the AI box?”

If inside - this guide doesn’t really share anything practical. Like if you’re going to be tinkering with a core algorithm and trying to optimize it, understanding BLAS and cuBLAS or whatever AMD / Apple / Google equivalent, then understanding what pandas, torch, numpy and a variety of other tools are doing for you, then being able to wield these effectively makes more sense.

If outside the box - understanding how to spot the signs of inefficient use of resource - whether that’s network, storage, accelerator, cpu, or memory, and then reasoning through how to reduce that bottleneck.

Like - I’m certain we will see this in the near future, but off the top of my head the innocent but incorrect things people do: 1. Sending single requests, instead of batching 2. Using a synchronous programming model when asynchronous is probably better 3. Sending data across a compute boundary unnecessarily 4. Sending too much data 5. Assuming all accelerators are the same. That T4 gpu is cheaper than an H100 for a reason. 6. Ignoring bandwidth limitations 7. Ignoring access patterns


Are there any surveys of just how many Windows Servers boxes exist?

Even when I was working at an Azure-only shop, I've never actually seen anyone use Windows Server. Lots of CentOS (before IBM ruined it) and other Unixes, but never a Windows Server.


We come across them all the time when doing internal network pentests (most organizations use AD for managing their fleet of end-user systems), and occasionally external tests as well. Stackoverflow is a site that comes to mind as being known for running their production systems on Windows Server.

It's useful to have experienced, but I do take issue with exclusively (or primarily) focusing on one ecosystem as a mostly-publicly-funded school.


Huh, TIL StackOverflow is on Windows Server


In this AI Age, It is crucial for developers to have a fundamental understanding of GPUs and their application to AI development.


Crucial, for all developers? The great majority will get AI through an API.


After a while there will be a time when you will have to be lending others API, for that yes crucial I guess.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: