GPU Survival Toolkit for the AI age

johndough · 2023-11-12T19:27:53.000000Z

The code in this article is incorrect. The CUDA kernel is never called: https://github.com/RijulTP/GPUToolkit/blob/f17fec12e008d0d37...

I'd also like to point out that 90 % of the time spent to "compute" the Mandelbrot set with the JIT-compiled code is spent on compiling the function, not on computation.

If you actually want to learn something about CUDA, implementing matrix multiplication is a great exercise. Here are two tutorials:

https://cnugteren.github.io/tutorial/pages/page1.html

https://siboehm.com/articles/22/CUDA-MMM

sevagh · 2023-11-12T20:07:01.000000Z

>If you actually want to learn something about CUDA, implementing matrix multiplication is a great exercise.

There is SAXPY (matrix math A*X+Y), purportedly ([1]) the hello world of parallel math code.

>SAXPY stands for “Single-Precision A·X Plus Y”. It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i].

[1]: https://developer.nvidia.com/blog/six-ways-saxpy/

Handprint4469 · 2023-11-12T19:47:48.000000Z

Thank you for this, comments like yours is exactly why I keep coming back to HN.

lordwiz · 2023-11-13T03:52:49.000000Z

Thanks a lot for pointing it out, I have fixed the code and updated the blog.

shortrounddev2 · 2023-11-12T17:41:36.000000Z

This article claims to be something every developer must know, but it's a discussion of how GPUs are used in AI. Most developers are not AI developers, nor do they interact with AI or use GPUs directly. Not to mention the fact that this articles barely mentions 3d graphics at all, the reason gpus exist

lucb1e · 2023-11-12T17:59:17.000000Z

One can benefit from knowing fundamentals of an adjacent field, especially something as broadly applicable as machine learning.

- You might want to use some ML in the project you are assigned next month

- It can help collaborating with someone who tackles that aspect of a project

- Fundamental knowledge helps you understand the "AI" stuff being marketed to your manager

The "I don't need this adjacent field" mentality feels familiar from schools I went to: first I did system administration where my classmates didn't care about programming because they felt like they didn't understand it anyway and they would never need it (scripting, anyone?); then I switched to a software development school where, guess what, the kids couldn't care about networking and they'd never need it anyway. I don't understand it, to me it's both interesting, but more practically: fast-forward five years and the term devops became popular in job ads.

The article is 1500 words at a rough count. Average reading speed is 250wpm, but for studying something, let's assume half of that: 1500/125 = 12 minutes of your time. Perhaps you toy around with it a little, run the code samples, and spend two hours learning. That's not a huge time investment. Assuming this is a good starting guide in the first place.

mrec · 2023-11-12T19:23:06.000000Z

The objection isn't to the notion that "One can benefit from knowing fundamentals of an adjacent field". It's that this is "The bare minimum every developer must know". That's a much, much stronger claim.

I've come to see this sort of clickbait headline as playing on the prevalence of imposter-syndrome insecurity among devs, and try to ignore them on general principle.

lucb1e · 2023-11-12T20:12:45.000000Z

Fair enough! I can kind of see the point that, if every developer knew some basics, it would help them make good decisions about their own projects, even if the answer is "no, this doesn't need ML". On the other hand, you're of course right that if you don't use ML, then it's clearly not something you "must" know to do your job well.

oytis · 2023-11-12T19:08:43.000000Z

> Most developers are not AI developers

I remember how I joined a startup after working for a traditional embedded shop and a colleague made (friendly) fun of me for not knowing how to use curl to post a JSON request. I learned a lot since then about backend, frontend and infrastructure despite still being an embedded developer. It seems likely that people all around the industry will be in a similar position when it comes to AI in the next years.

antifa · 2023-11-13T03:35:30.000000Z

Most AI work will just be APIs provided by your cloud provider in less than 2 years. Understanding what's going on under the hood isn't going to be that common, maybe the AI equivalent of "use explain analyze, optimize indexes" will be what passes for (engineering, not scientist) AI expert around that time.

torginus · 2023-11-13T05:55:36.000000Z

Most things provided by your cloud cloud provider are just slightly modified and pre-packaged versions of software you can run anyway. Postgres on EC2 is a perfectly viable alternative to whatever Amazon offers.

gymbeaux · 2023-11-15T17:08:11.000000Z

AWS Glue is just Apache Spark but you can’t debug it when it errors because all that is “conveniently” obfuscated from you.

SageMaker is JupyterLab with a GPU attached.

Cognito is just OAuth.

And of course networking is fucked… somehow AWS made it more complicated than the real thing, like they abstracted in the opposite direction.

j45 · 2023-11-13T17:34:42.000000Z

If so it will create more consumers than creators

hhjinks · 2023-11-12T19:46:41.000000Z

What do you think the industry will look like in the near future?

pixelpoet · 2023-11-12T18:41:38.000000Z

Not to mention their passing example of Mandelbrot set rendering only gets a 10x speedup, despite being the absolute posterchild of FLOPs-limited computation.

Terrible article IMO.

pclmulqdq · 2023-11-12T19:40:56.000000Z

You would expect at least 1000x, and that's probably where it would be if they didn't include JIT compile time in their time. Mandelbrot sets are a perfect example of a calculation a GPU is good at.

sigmonsays · 2023-11-12T18:00:35.000000Z

yeah a lot of assumptions were made that are inaccurate.

I agree that most developers are not AI developers... OP seems to be a bit out of touch with the general population and otherwise is assuming the world around them based on their own perception.

bigstrat2003 · 2023-11-12T21:04:19.000000Z

I've noticed that every time I see an article claiming that its subject is something "every developer must know", that claim is false. Maybe there are articles which contain information that everyone must know, but all I encounter is clickbait.

j45 · 2023-11-12T19:05:29.000000Z

Understanding now hardware is used is very beneficial for programmers

Lost of programmers started with an understanding of what happens physically on the hardware when code runs and it is unfair advantage when debugging at times

shortrounddev2 · 2023-11-13T15:23:53.000000Z

> Understanding now hardware is used is very beneficial for programmers

I agree, but to say that all developers must know how AI benefits from GPUs is a different claim. One which is false. I would say most developers don't even understand how the CPU works, let alone modern CPU features like Data/Instruction Caching, SIMD instructions, and Branch prediction.

Most developers I encounter learned Javascript and make websites

j45 · 2023-11-13T19:03:25.000000Z

Ok you’re entitled to that belief.

AI is also just software that runs on hardware.

GPUs are just hardware

BlueTemplar · 2023-11-12T21:24:27.000000Z

Even worse, it says "GPUs", but isn't CUDA a closed feature limited to Nvidia cards, and maybe even a subset of them ?

(I'm not touching Nvidia since they don't provide open source drivers.)

sbmthakur · 2023-11-12T20:39:48.000000Z

I would have probably opened it if it weren't for the title bait.

outside1234 · 2023-11-12T19:25:33.000000Z

And honestly, for most "AI developers" if you are training your own model these days (versus using an already trained one) - you are probably doing it wrong.

Der_Einzige · 2023-11-12T18:41:28.000000Z

Don't worry, you'll either be an AI developer or unemployed within 5 years. This is indeed important for you, regardless if you recognize this yet or not.

shortrounddev2 · 2023-11-13T20:31:18.000000Z

Just like crypto right

anonylizard · 2023-11-12T14:56:31.000000Z

I think python is dominant in AI, because the python-C relationship mirrors the CPU-GPU relationship.

GPUs are extremely performant, and also very hard to code in, so people just use highly abstracted API calls like pytorch to command the GPU.

C is very performant, and hard to code in, so people just use python as a abstraction layer over C.

Its not clear if people need to understand GPUs that much (Unless you are deep in AI training/ops land). In time, since moore's law has ended and multithreading becomes the dominant mode of speed increases, there'll probably be brand new languages dedicated to this new paradigm of parallel programming. Mojo is a start.

mft_ · 2023-11-12T15:57:11.000000Z

I've wondered for a while: is there a space for a (new?) language which invisibly maximises performance, whatever hardware it is run on?

As in, every instruction, from a simple loop of calculations onward, is designed behind the scenes so that it intelligently maximises usage of every available CPU core in parallel, and also farms everything possible out to the GPU?

Has this been done? Is it possible?

Simpliplant · 2023-11-12T16:06:57.000000Z

Not exactly it but Mojo sounds closest from available options

https://www.modular.com/mojo

jfoutz · 2023-11-12T16:06:16.000000Z

There's definitely a space for it. It may even be possible. But if you consider the long history of lisp discussions (flamewars?) about "a sufficiently smart compiler" and comparisons to C. Or maybe Java vs C++, it seems unlikely. At least very very difficult.

There are little bits of research on algorithm replacement. Like, have the compiler detect that you're trying to sort, and generate the code for quick sort or timsort. it works, kinda. There are a lot of ways to hide a sort in code, and the compiler can't readily find them all.

kaba0 · 2023-11-12T18:07:35.000000Z

Not for mixed CPU/GPU, but there is the concept of a superoptimizer, that basically brute forces for the most optimal correct code. But it is not practical, besides using for very very short program snippets (and they are usually CPU-only, though there is nothing fundamental why it couldn’t utilize the GPU as well).

There is also https://futhark-lang.org/ , though I haven’t tried it, just heard about it.

pjc50 · 2023-11-12T16:35:11.000000Z

I'm not sure that's even possible in principle; consider the various anti-performance algorithms of proof-of-waste systems, where every step is data-dependent on the previous one and the table of intermediate results required may be made arbitrarily big.

It's a bit like "design a zip algorithm which can compress any file".

drdeca · 2023-11-12T19:58:06.000000Z

I don’t see why such a “proof of waste” algorithm would be an obstacle to such an optimizer existing. Wouldn’t it just be that for such computational problems, the optimal implementation would still be rather costly? That doesn’t mean the optimizer failed. If it made the program as efficient as possible, for the computational task it implements, then the optimizer has done its job.

lwhi · 2023-11-12T16:48:34.000000Z

I'd imagine it wouldn't be very difficult to build language constructs that are able to denote when high parallelism is desirable; and let the compiler deal with this information as necessary.

teaearlgraycold · 2023-11-12T16:01:04.000000Z

I’m not sure if that’s a good idea at the moment, but we should start with making development with vector instructions more approachable. The code should look more or less the same as working with u64s.

howling · 2023-11-12T17:11:55.000000Z

You might be interested in https://github.com/HigherOrderCO/HVM

runlaszlorun · 2023-11-12T20:42:11.000000Z

HVM looks very interesting. Thx for posting.

huijzer · 2023-11-12T16:07:45.000000Z

There are many languages doing that more or less. Jax and Mojo for example.

sitkack · 2023-11-12T15:20:24.000000Z

Moore’s law is far from over and multithreading is not the answer. Your opening sentence is spot on tho.

Kamq · 2023-11-12T16:07:57.000000Z

> Moore’s law is far from over and multithreading is not the answer.

Wut? We hit the power wall back in 2004. There was a little bit of optimization around the memory wall and ilp wall afterwards, but really, cores haven't gotten faster since.

It's been all about being able to cram more cores in since then, which implies at least multi-threading, but multi-processing is basically required to get the most out of a cpu these days.

ikura · 2023-11-12T17:04:27.000000Z

Moore's law is "the observation that the number of transistors in an integrated circuit doubles about every two years". For a while clock speed was a proxy for that metric, but it's not the 'law' itself.

Kamq · 2023-11-12T17:15:28.000000Z

Yeah, but today number of cores is the rough proxy for that metric.

How do you operate in that world if "multithreading isn't the answer"?

samus · 2023-11-12T19:48:34.000000Z

Modern CPUs contain a lot more computing units than cores. For a while, hyperthreading was thought to be a useful way to make use of them. More recently, people have turned to advanced instruction sets like SSE and AVX.

FridgeSeal · 2023-11-12T20:36:50.000000Z

Those things aren’t mutually exclusive. Also firstly, I suspect there’s more “low hanging fruit” in making more software make use of more cores. We’re increasingly getting better languages, tooling and libs for multi threading stuff, and it’s far more in the realm of your average developer than writing SIMD compatible code and making sure your code can pipeline properly.

samus · 2023-11-12T21:12:00.000000Z

Threads are of course appropriate to implement high-level concurrency and parallelism. But for fine-grained parallelism, they are unwieldy and have high overhead.

Spreading an algorithm across multiple threads makes it more difficult for an optimizing compiler to find opportunities for SIMD optimization.

Similarly to how modern languages make it easier to safely use threads, runtimes also make it easier to take advantage of SIMD optimizations. For example, recently a SIMD-optimized sorting algorithm was included in OpenJDK. Apart from that, SIMD is way less brittle at runtime than GPUs and other accelerators.

bmc7505 · 2023-11-12T15:59:14.000000Z

Care to elaborate?

guyomes · 2023-11-12T16:38:59.000000Z

No idea about the future of the Moore's law precisely. Yet recent research results show that there is still room for faster semiconductors, as discussed on HN [0].

[0]: https://news.ycombinator.com/item?id=38201624

bmc7505 · 2023-11-12T19:02:11.000000Z

I mean why isn’t multithreading the answer?

samus · 2023-11-12T19:59:40.000000Z

Multithreading has the following disadvantages:

* Overhead: the overhead to start and manage multiple threads is considerable in practice. Most multithreaded algorithms are in fact slower than optimized serial implementations when n_threads=1

* Communication: threads have to communicate with each other and synchronize access to shared resources. "Embarrassingly parallel" problems don't require synchronization, but many interesting problems are not of that kind.

* Amdahl's law: there is a point of diminishing returns on parallelizing an application since it quite likely contains parts that are not easily parallelized.

anticensor · 2023-11-16T11:24:57.000000Z

You forgot Wirth's law: software complexity compensates increase in speed of hardware.

bmc7505 · 2023-11-12T21:01:03.000000Z

Well, yes. But assuming the question is how to reduce overall latency, what alternative is there besides algorithmic improvements or increasing clock frequency?

samus · 2023-11-13T08:29:37.000000Z

Recently, SIMD has been increasingly used to make use of the increased number of compute units in a CPU. Sure, it's not as easy and straightforward to use as multithreading (I never expected to say that about multithreading), but libraries and programming languages make more and more use of them.

Edit: latency is difficult. But accellerating CPUs and using GPUs for compute was never about latency. Most I/O bottlenecks are because CPUs have sped up so much and left the rest of the platform in the dust. Much of it is also due to fundamental limitations due to the speed of light. Increasing throughput is always easier that reducing latency.

bmc7505 · 2023-11-13T13:18:33.000000Z

I guess I was thinking more of multithreading as an umbrella term for parallelism but I see your point. If you squint, SIMD and multithreading are the same problem. What I mean is, if you have a computable function you want to evaluate on a finite amount of bits, the problem of how to physically layout and schedule the computation to minimize latency is very similar to maximizing throughput.

From a circuit complexity standpoint, you could evaluate a wide but shallow circuit to evaluate the function in a nanosecond, or a deep but narrow circuit that takes eons. Whether parts of that circuit are evaluated synchronously or asynchronously is immaterial, although synchronous computation does seem easier to reason about and the UX is more user-friendly from a programmability standpoint.

I agree with you the fundamental limitation is the speed of light if you are width-bounded (i.e., if physical space is the dominating constraint).

danielmarkbruce · 2023-11-12T15:23:47.000000Z

GPUs aren't that difficult to program. CUDA is fairly straightforward for many tasks and in many cases there is an easy 100x improvement in processing speed just sitting there to be had with <100 lines of code.

z2h-a6n · 2023-11-12T15:38:56.000000Z

Sure, but if you've never used CUDA or any other GPU framework, how many lines of documentation do you need to read, and how many lines of code are you likely to write and rewrite and delete before you end up with those <100 lines of working code?

danielmarkbruce · 2023-11-12T15:41:52.000000Z

They have nice getting started guides. Try it and see. It's... really pretty simple. There is a reason they've built a trillion $$ company - they've done a great job.

z2h-a6n · 2023-11-12T16:13:45.000000Z

Ok, I tried it (for as long as my limited free time and interest in CUDA allows). The closest I came to a getting started guide is this [0], which by my (perhaps naive count) is 25561 lines of documentation, and I would probably need to learn more C++ to understand it in detail.

I'm sure CUDA is great, and if I had more free time and/or better reasons to improve the performance of my code it would probably be great for me. My point was mainly that a few lines of code which may be trivial for one person to write may not be for someone else with different experience. Depending on what the code is being used for even a vast increase in performance may not be worth the extra time it takes to implement it.

[0] https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

danielmarkbruce · 2023-11-12T16:34:42.000000Z

https://developer.nvidia.com/blog/even-easier-introduction-c...

jprete · 2023-11-12T16:14:25.000000Z

I haven’t seen it but I can believe it - parallel programming of any variety is hard, and to be successful as a vendor of such a system would require uncommonly good API design to get the kind of traction that CUDA has gotten.

Kamq · 2023-11-12T16:00:01.000000Z

> Sure, but if you've never used CUDA or any other GPU framework, how many lines of documentation do you need to read, and how many lines of code are you likely to write and rewrite and delete before you end up with those <100 lines of working code?

If you're already familiar with one of the languages that the nvidia compiler supports? Not that many. For people familiar with C or C++, it's a couple extra attributes, and a slightly different syntax for launching kernels vs a regular function call. I'm admittedly not experienced with Fortran, which is the other language they support, so I can't speak to that. There's c-style memory allocation functions, which might be annoying to C++ devs, but it's nothing that would confuse them.

Edit: There's also a couple weird magic globals you have access to in a kernel (blockIdx, blockDim, threadIdx), but those are generally covered in the intros.

wiz21c · 2023-11-12T16:12:59.000000Z

My experience is different. I had to code various instances of "parallel reduction" and "prefix sum" and it's not easy to get into it (took me a day or two). Moreover, coming from an age where 640KB of RAM where considered quite enough, truly realizing the power of the GPU was not easy because my tasks are not quite parallel and not quite thread coherent (I grant you that doing naturally parallel stuff is dead simple). It took me a while and a lot of nvidia-nsight to max out the GPU... Moreover, I was a bit slow to actually understand how powerful a GPU is (for example, my GPU gave poor performances unless I gave it a problem big enough, so I was (wrongly) disappointed when testing toy problems)

But once that challenge is overcome, GPU truly rocks.

Finally debugging complex shaders (I do some specific case of computational fluid dynamics where equations are not that easy, full of "if/then" edge cases, etc) is not fun at all, tooling is sorely missed (unless I've missed something)

Kamq · 2023-11-12T17:14:09.000000Z

This is fair, and if you've got the time and inclination, I'd love to hear about your experience and the tricks you ended up pulling. There are definitely advanced areas of CUDA, and you can go deeper on nearly anything.

But we are in a comment chain spawned by:

> CUDA is fairly straightforward for many tasks and in many cases there is an easy 100x improvement in processing speed just sitting there to be had with <100 lines of code.

And a follow up comment about how easy it would be to write that "<100 lines of code", so I feel like we're definitely talking about the easy case of naturally parallel calculations, and sticking to that as an intro seems fair to me.

llukas · 2023-11-12T18:51:25.000000Z

There are libraries that help with more tricky stuff on-device like cub or cuFFTDx.

Kamq · 2023-11-14T02:25:56.000000Z

Sure, but there's value in understanding how to do it yourself, even if you don't use it at work.

And there's also value in seeing how other people approached a problem.

samus · 2023-11-12T19:40:59.000000Z

Parallel reduction and prefix sums are exercises to train people how to reformulate algorithms for the GPU and how to identify performance bottlenecks specific to GPUs. In practice, you'd use library functions.

mhh__ · 2023-11-12T19:23:52.000000Z

You can pretty easily get C performance (I'd argue that C's lack of abstraction makes slow but simple code more appealing) with pythonic expressiveness pretty easily with a more modern language.

sgbeal · 2023-11-12T16:04:34.000000Z

> C is very performant, and hard to code in, so people just use python as a abstraction layer over C.

C is a way of life. Those of us who code exclusively, or nearly so, in C cannot stomach python's notion of "significant white-space."

yoyohello13 · 2023-11-12T16:08:40.000000Z

I code in both (c for hobbies and python professionally) and “significant white space” is a non-issue if you spend any amount of time getting used to it.

Complaining about significant white-space is like complaining that lisp has too many parentheses. It’s an aesthetic preference that just doesn’t matter in practice.

adventured · 2023-11-12T16:35:05.000000Z

A form of Sayre's Law is very common in tech (eg spaces vs tabs; framework vs framework; language vs language).

mhh__ · 2023-11-12T19:28:24.000000Z

I actually find python and C very similar in spirit.

Syntax is mostly an irrelevance, they have surprisingly similar patterns in my opinion.

In a modern language I want a type system that both reduces risk and reduces typing — safety and metaprogramming. C obviously doesn't, python doesn't really either.

Python's approach to dynamic-ness is very similar to how I'd expect C to be as a dynamic language (if it had proper arrays/lists).

dboreham · 2023-11-12T16:52:52.000000Z

You get used to the significant whitespace. (C programmer since ~1978).

bigstrat2003 · 2023-11-12T21:08:16.000000Z

I never did, and it's one of the things I hate most about Python to this day. I still use Python because it's the best tool a lot of the time, but it's such a terrible language decision to have significant whitespace imo.

adolph · 2023-11-12T16:23:07.000000Z

> cannot stomach python's notion of "significant white-space."

Why belly ache about it? Whitespace is significant to one’s fellow humans.

boredtofears · 2023-11-12T16:54:36.000000Z

Precisely why it should be of no significance to the machine.

drdrey · 2023-11-12T18:13:11.000000Z

Source code is not for the machine to read, it’s for your fellow humans

boredtofears · 2023-11-13T01:48:14.000000Z

That doesn't mean that source code has no meaning to the interpreter.

NeutralCrane · 2023-11-12T17:27:53.000000Z

All programming languages, including C, have significant white space. Python just has slightly more.

Phemist · 2023-11-12T16:25:40.000000Z

Wait till you start using the black formatter tool.

Well-known for supporting any formatting style you like ;)

Matumio · 2023-11-12T17:00:39.000000Z

> When faced with multiple tasks, a CPU allocates its resources to address each task one after the other

Ha! I wish CPUs were still that simple.

Granted, it is legitimate for the article to focus on the programming model. But "CPUs execute instructions sequentially" is basically wrong if you talk about performance. (There are pipelines executing instructions in parallel, there is SIMD, and multiple cores can work on the same problem.)

pclmulqdq · 2023-11-12T17:16:27.000000Z

I think this post focused on the wrong things here. CPUs with AVX-512 also have massive data parallelism, and CPUs can execute many instructions at the same time. The big difference is that CPUs spend a lot of their silicon and power handling control flow to execute one thread efficiently, while GPUs spend that silicon on more compute units and hide control flow and memory latency by executing a lot of threads.

mhh__ · 2023-11-12T19:20:57.000000Z

It will do multipleS SIMD instructions at the same time, too.

JonChesterfield · 2023-11-12T17:00:14.000000Z

The CPUs are good at serial code and GPUs are good at parallel code is kind of true but something of an approximation. Assume equivalent power budget in the roughly hundreds of watts range, then:

A CPU has ~100 "cores" each running one (and-a-hyperthread). independent things, and it hides memory latency by branch prediction and pipelining.

A GPU has ~100 "compute units", each running ~80 independent things interleaved, and it hides memory latency by executing the next instruction from one of the other 80 things.

Terminology is a bit of a mess, and the CPU probably has a 256bit wide vector unit while the GPU probably has a 2048bit wide vector unit, but from a short distance the two architectures look rather similar.

mmoskal · 2023-11-12T18:09:55.000000Z

GPU has 10x the memory bandwidth of the CPU though, which becomes relevant for the LLMs where you essentially have to read the whole memory (if you're batching optimally, that is using all the memory either for weights or for KV cache) to produce one token of output.

winwang · 2023-11-12T18:31:01.000000Z

GPUs also have 10x-100x FP/INT8 throughput watt-for-watt.

hurryer · 2023-11-12T20:08:16.000000Z

GPU also has 10x memory latency compared to CPU.

And memory access order is much more important that on CPU. Truly random access has very bad performance.

bee_rider · 2023-11-12T17:17:51.000000Z

I’m always surprised there isn’t a movement toward pairing a few low latency cores with a large number of high throughput cores. Surround single Intel P core with a bunch of E cores. Then, hanging off the E cores, stick a bunch of iGPU cores and/or AVX-512 units.

Call it Xeon Chi.

Const-me · 2023-11-12T18:40:02.000000Z

I think one possible reason for that, ideally these things need different memory.

If you use high-bandwidth high-latency GDDR memory, CPU cores will underperform due to high latency, like there: https://www.tomshardware.com/reviews/amd-4700s-desktop-kit-r...

If you use low-latency memory, GPU cores will underperform due to low bandwidth, see modern AMD APUs with many RDNA3 cores connected to DDR5 memory. On paper, Radeon 780M delivers up to 9 FP32 TFLOPS, the figure is close to desktop version of Radeon RX 6700 which is substantially faster in gaming.

bee_rider · 2023-11-12T18:49:00.000000Z

Hmm, that is a good point. Since it is a dream-computer anyway, maybe we can do 2.5d packaging; put the ddr memory right on top so the P cores can reach it quickly, then surround the whole thing with GDDR.

softfalcon · 2023-11-12T17:50:16.000000Z

Neat idea, probably even viable!

I think they may have a hurdle of getting folks to buy into the concept though.

I imagine it would be analogous to how Arria FPGA’s were included with certain Xeon CPU’s. Which further backs up your point that this could happen in the near future!

pixelpoet · 2023-11-12T18:42:56.000000Z

You mean like an iGPU?

Edit: Oh, thanks for the downvote, with no discussion of the question. I'll just sit here quietly with my commercial OpenCL software that happily exploits these vector units attached to the normal CPU cores.

bee_rider · 2023-11-13T05:13:17.000000Z

I’m not sure who downvoted; I think it isn’t possible to downvote a response to one’s comment.

I did decide not to engage because “you mean like <very common well known thing>” seemed a bit brusque and dismissive, commenting on this site is just for fun, so I don’t really see the point in continuing a conversation that seems like it is getting off on the wrong foot.

alberth · 2023-11-12T16:16:14.000000Z

Nx / Axon

Given that most programming languages are designed for sequential processing (like CPUs), but Erlang/Elixir is designed for parallelism (like GPUs) … I really wonder if Nx / Axon (Elixir) will take off.

https://github.com/elixir-nx/

oytis · 2023-11-12T16:38:18.000000Z

Erlang was designed for distributed systems with a lot of concurrency not for computation-heavy parallelism

matrss · 2023-11-12T17:46:59.000000Z

I am really wondering how well Elixir with Nx would perform for computation heavy workloads on a HPC cluster. Architecturally, it isn't that dissimilar to MPI, which is often used in that field. It should be a lot more accessible though, like numpy and the entire scientific python stack.

zoogeny · 2023-11-12T18:04:44.000000Z

I've been investigating this and I wonder if the combination of Elixir and Nx/Axon might be a good fit for architectures like NVIDIA Grace Hopper where there is a mix of CPU and GPU.

coffeebeqn · 2023-11-12T16:47:13.000000Z

Would that run on a GPU? I think the future is having both. Sequential programming is still the best abstraction for most tasks that don’t require immense parallel execution

dartos · 2023-11-12T16:48:17.000000Z

Axon runs compute graphs on gpu, but elixirs parallelism abstractions run on cpu

password4321 · 2023-11-12T16:01:50.000000Z

I need a buyers guide: what's the minimum to spend, and best at a few budget tiers? Unfortunately that info changes occasionally and I'm not sure if there's any resource that keeps on top of things.

alsodumb · 2023-11-12T16:11:11.000000Z

https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

This is the best one imo.

johndough · 2023-11-12T19:35:44.000000Z

Google Colab, Kaggle Notebooks and Paperspace Notebooks all offer free GPU usage (within limits), so you do not need to spent anything to learn GPU programming.

https://colab.google/

https://www.kaggle.com/docs/notebooks

https://www.paperspace.com/gradient/free-gpu

fulafel · 2023-11-12T19:01:04.000000Z

For learning basics of GPU programming your iGPU will do fine. Actual real-world applications are very varied of course.

coffeebeqn · 2023-11-12T16:49:12.000000Z

You can also rent compute online if you don’t want to immediate plop down 1-2k

aunty_helen · 2023-11-12T17:08:59.000000Z

We’re back to “every developer must know” clickbait articles?

mhh__ · 2023-11-12T19:33:42.000000Z

Although I think they'll be replaced by ChatGPT a good article in that style is actually quite valuable.

I like attacking complexity head on, and have a good knowledge of both quantitative methods & qualitative details of (say) computer hardware so having an article that can tell me the nitty gritty details of a field is appreciated.

Take "What every programmer should know about memory" — should every programmer know? Perhaps not, but every good programmer should at least have an appreciation of how a computer actually works. This pays dividends everywhere — locality (the main idea that you should take away from that article) is fast, easy to follow, and usually a result of good code that fits a problem well.

igh4st · 2023-11-12T19:08:11.000000Z

it seems so... Should take this article's statements with a grain of salt.

rushingcreek · 2023-11-12T16:42:43.000000Z

Good read. However, the AWS P5 instance (along with P4d and P4de) is most certainly oriented towards training, not inference. The most inference-friendly instance types are the G4dn and the G5, which feature T4 and A10G GPUs, respectively.

axpy906 · 2023-11-12T16:55:48.000000Z

Came here to say this author forgot G5.

k1ns · 2023-11-12T16:27:42.000000Z

I am very new to GPU programming in general and this article was a fun read. It's amazing how far we've come, prime example being able to train a simple "dog or cat" NN that easily.

arriu · 2023-11-12T16:21:15.000000Z

Are amd GPUs still to be avoided or are they workable at this point?

JonChesterfield · 2023-11-12T16:53:55.000000Z

The cuda happy path is very polished and works reliably. The amdgpu happy path fights you a little but basically works. I think the amd libraries starting to be packaged under Linux is a big deal.

If you don't want to follow the happy path, on Nvidia you get to beg them to maybe support your use case in future. On amdgpu, you get the option to build it yourself, where almost all the pieces are open source and pliable. The driver ships in Linux. The userspace is on GitHub. It's only GPU firmware which is an opaque blob at present, and that's arguably equivalent to not being able to easily modify the silicon.

latchkey · 2023-11-12T18:15:09.000000Z

AMD GPUs work great, the issue is that people don't want to mess with ROCm/HIP when CUDA is kind of the documented workflow. Along with the fact that ROCm was stagnant for a long time. AMD missed the first AI wave, but are now committed to making ROCm into the best it can be.

The other problem is that there aren't any places to rent the high end AMD AI/ML GPUs, like the MI250's and soon to be released MI300's. They are only available on things like the Frontier super computer, which few developers have access to. "regular" developers are stuck without easy access to this equipment.

I'm working on the later problem. I'd like to create more of a flywheel effect. Get more developers interested in AMD by enabling them to inexpensively rent and do development on them, which will create more demand. @gmail if you'd like to be an early adopter.

z5h · 2023-11-12T16:33:48.000000Z

We have compilers (languages) like Futhark that aim to optimize explicitly parallel operations. And universal models of computation like interaction nets that are inherently parallel.

Am I lazy to expect we’ll be getting a lot more “parallel-on-the-GPU-for-free” in the future?

zozbot234 · 2023-11-12T16:54:28.000000Z

You can already convert a compute graph to GPU-optimized code using something like Aesara (formerly known as Theano) or TensorFlow. There are also efforts in the systems space that ought to make this kind of thing more widespread in the future, such as the MLIR backend for LLVM.

Matumio · 2023-11-12T18:18:04.000000Z

The Mandelbrot example seems to make interpreted Python stand in for "the CPU performance"?

If that's true, then I'm surprised they only see a 10x speed-up. I would expect more from only compiling that loop for the CPU. (Comparing to interpreted Python without numpy.) Given they already have a numba version, why not compile it for the CPU and compare?

Also, they say consumer CPUs have 2-16 cores. (Who has 2 cores these days?) They go on suggest to rent an AWS GPU for $3 per hour. You're more likely to get 128 cores for that price, still on a single VM.

Not saying it will be easy to write multi-threaded code for the CPU. But if you're lucky, the Python library you're using already does it.

lucb1e · 2023-11-12T18:31:26.000000Z

> Also, they say consumer CPUs have 2-16 cores. (Who has 2 cores these days?)

Pretty sure my mom's laptop has 2 cores; I can't think of anyone whose daily driver has 16 cores. Real cores, not hyperthread stuff running at 0.3× the performance of a real core.

As for the 128-core server system, note that those cores are typically about as powerful as a 2008 notebook. My decade-old laptop CPU outperforms what you get at DigitalOcean today, and storage performance is a similar story. The sheer number makes up for it, of course, but "number of cores" is not a 1:1 comparable metric.

Agree, though, that the 10x speedup seems low. Perhaps, at 0.4s, a relatively large fraction of that time is spent on initializing the Python runtime (`time python3 -c 'print("1337")'` = 60ms), the module they import that needs to do device discovery, etc.? Hashcat, for example, takes like 15 seconds to get started even if it then runs very fast after that.

65a · 2023-11-12T21:40:10.000000Z

My 2016 desktop had 22 cores and 44 threads. You can have the same processor for < $200 on ebay right now.

lucb1e · 2023-11-12T23:01:03.000000Z

Cool, now I know one person :D

I don't think the price matters btw. If 4 cores is all I need (I can't think of a desktop application that I use which benefits from more than 4 cores more than from faster cores, which is typically the trade-off at >4 real cores) then that's what I'd get because that's the optimum for my wallet and performance profile.

the__alchemist · 2023-11-12T15:11:16.000000Z

Another tip that took me longer than I wished to figure out.

Use CUDA, vice graphics APIs+compute. The latter (Vulkan compute etc) is high friction. CUDA is far easier to write in, and the resulting code is easier to maintain and update.

behnamoh · 2023-11-12T15:18:38.000000Z

Yes, and in the process, contribute to Nvidia’s monopoly.

raincole · 2023-11-12T15:29:48.000000Z

Very few professional artists refuse using Photoshop or After Effects because it will "contribute to Adobe's monopoly".

But for some reasons professional programmers are judged under a much higher moral standard.

ickelbawd · 2023-11-12T15:44:06.000000Z

I think because professional artists can’t typically make their software tools. Whereas engineers could in theory make their own tools. Naturally few do in practice though as tech has become far too large and specialized. But our roots are where our values and ideals come from.

hutzlibu · 2023-11-12T16:09:36.000000Z

That is a really theoretical point.

If I start to work on a tool, then I cannot work anymore on what I actually wanted to do. And it just so happens ... that this is exactly what I did and I can just say, it usually takes way longer than the most pessimistic estimate one can come up with, so yes, one can decide to switch careers and try to get funding to (re)build what is not offered to acceptable conditions (but in my case the tool simply did not exist, though).

Just like an artist can switch career, study CS, build on his own a tool a professional company build with a team over years - and then someday work with his tool to acomplish his original work. In (simplified) theories, lots of things are possible ..

Aurornis · 2023-11-12T15:58:04.000000Z

> But for some reasons professional programmers are judged under a much higher moral standard

Not in the real world. Most programmers who are trying to get a job done won’t avoid CUDA or AWS or other tools just to avoid “contributing to a monopoly”. When responsible programmers have a job to do and tools are available to help with the job, they get used.

A programmer who avoids mainstream tools on principle is liable to get surpassed by their peers very quickly. I’ve only met a few people like this in industry and they didn’t last very long trying to do everything the hard way just to avoid tools from corporations or monopolies or open source that wasn’t pure enough for their standards.

It’s only really in internet comment sections that people push ideological purity like this.

z3phyr · 2023-11-12T16:27:45.000000Z

The same attitude brought us adaptation of linux. So IDK

arcanemachiner · 2023-11-12T17:34:01.000000Z

Most lottery tickets aren't winners.

Karliss · 2023-11-12T16:50:59.000000Z

Tool choice of artists has close to 0 impact on people interacting with final work. Choices made by programmers are amplified through the users of produced software.

yowlingcat · 2023-11-12T20:29:09.000000Z

> But for some reasons professional programmers are judged under a much higher moral standard.

I believe the key word there is "professional" -- one of the challenges of a venue like HN is the professional engineers and the less-professional ones interact from worldviews and use cases so distinct that they may as well be separate universes. In other spaces, we wouldn't let a top doctor have to explain very basic concepts about the commercial practice of medicine to an amateur "skeptic" and yet so many discussions on HN degenerate along just these lines.

On the other hand, it's that very same inclusiveness and generally high discourse in spite of that wide expanse which make HN such a special community, so I'm not sure what to conclude besides this unfortunate characteristic being a necessary "feature, not a bug" of the community. There's no way around it that wouldn't make the community a lesser place, I think.

timeon · 2023-11-12T19:57:29.000000Z

And they ended up with Creative Cloud bloatware.

bogwog · 2023-11-12T15:24:13.000000Z

It isn't the consumer's responsibility to regulate the market.

danielmarkbruce · 2023-11-12T15:25:32.000000Z

And, in this case NVIDIA earned it. They built a very useful software layer around their chips.

calamari4065 · 2023-11-12T16:12:01.000000Z

Then whose responsibility is it? Corporations? The government? Or maybe the tooth fairy?

tjoff · 2023-11-12T16:21:09.000000Z

That is a weak argument that could be used to justify tons of behavior, very convenient.

Vote with your feet. Maybe you can't or can't afford it, then at least admit the problem to yourself and maybe don't try to persuade others in order to feel better for your own decision.

dzikimarian · 2023-11-12T15:51:27.000000Z

If consumers don't care about their money, then who would?

tovej · 2023-11-12T15:28:59.000000Z

Eh, CUDA can mostly be transformed to HIP, unless you use specialized NVIDIA stuff.

ilaksh · 2023-11-12T15:36:30.000000Z

Is there something that is not vendor specific? Maybe a parallel programming language that compiles to different targets?

..and doesn't suck.

atq2119 · 2023-11-12T20:15:05.000000Z

The part that I don't understand is why AMD/Intel/somebody else don't just implement at least the base CUDA for their products.

HIP is basically that, but they still make you jump through hoops to rename everything etc.

There are libraries written at a lower level that wouldn't be immediately portable, but surely that could be addressed over time as well.

gjsman-1000 · 2023-11-12T15:34:13.000000Z

So sayeth the person who has never written OpenCL.

the__alchemist · 2023-11-12T15:29:07.000000Z

I'm with you, and am surprised there isn't comparable competition.

dinosaurdynasty · 2023-11-12T16:32:47.000000Z

I wish someone would make the latter much easier, as someone who is definitely not the only person to get interested in this AI stuff lately who has a high tier AMD card that could surely do this stuff and would like to run this stuff locally for various reasons.

Currently I've given up and use runpod, but still...

nwoli · 2023-11-12T15:23:44.000000Z

I agree cuda is really nice to write in, but what reason do you have to write raw cuda I’m curious? Usually I find that it’s pre written kernels you deal with

the__alchemist · 2023-11-12T15:31:24.000000Z

Currently doing computational chemistry, but per the article, it's a fundamental part of my toolkit going forward; I think it will be a useful tool for many applications.

drdrey · 2023-11-12T15:24:36.000000Z

What about on macOS? Is OpenCL viable?

convexstrictly · 2023-11-12T21:43:54.000000Z

A great beginner guide to GPU programming concepts:

https://github.com/srush/GPU-Puzzles

dartos · 2023-11-12T16:56:02.000000Z

I don’t think transformer models generate multiple tokens in parallel (how could they?)

They just leverage parallelism in making a single prediction

atomicnature · 2023-11-12T16:59:46.000000Z

Transformers tend to be trained in parallel. BERT = 512 tokens per context, in parallel. GPT too is trained while feeding in multiple words in parallel. This enables us to build larger models. Older models, such as RNNs couldn't be trained this way, limiting their power/quality.

zozbot234 · 2023-11-12T17:44:46.000000Z

This is only sort of true, since you can still train RNNs (including LSTM, etc.) in big batches-- which is usually plenty enough to make use of your GPU's parallel capabilities. The inherently serial part only applies to the length of your context. Transformer architectures thus happen to be helpful if you have lots of idle GPU's such that you're actually constrained by not being able to parallelize along the context dimension.

atomicnature · 2023-11-12T18:04:21.000000Z

In RNN, hidden states are to be sequential; in transformers with attention mechanism, we break free of the sequential requirement. Transformers are more amenable to parallelism, and make use of GPUs the most (within the context axis, and outside).

dartos · 2023-11-12T17:06:16.000000Z

Ahh, that makes a lot of sense

DougBTX · 2023-11-12T17:09:48.000000Z

A quick search uncovers [0] with a hint towards an answer: just train the model to output multiple tokens at once.

[0] https://arxiv.org/abs/2111.12701

shmerl · 2023-11-12T18:37:29.000000Z

Shouldn't the article mention SIMD? I haven't seen it even being brought up.

bilsbie · 2023-11-12T16:50:39.000000Z

Can I do this in my GeForce GTX 970 4GB?

lucb1e · 2023-11-12T17:50:17.000000Z

> AWS GPU Instances: A Beginner's Guide [...] Here are the different types of AWS GPU instances and their use cases

The section goes on to teach Amazon-specific terminology and products.

A "bare minimum everyone must know" guide should not include vendor-specific guidance. I had this in school with Microsoft already, with never a mention of Linux because they already paid for Windows Server licenses for all of us...

Edit: and speaking of inclusivity, the screenshots-of-text have their alt text set to "Alt text". Very useful. It doesn't need to be verbatim copies, but it could at least summarize in a few words what you're meant to get from the terminal screenshot to help people that use screen readers.

Since this comment floated to the top, I want to also say that I didn't mean for this to dominate the conversation! The guide may not be perfect, but it helped me by showing how to run arbitrary code on my GPU. A few years ago I also looked into it, but came away thinking it's dark magic that I can't make use of. The practical examples in both high- and low-level languages are useful

Another edit: cool, this comment went from all the way at the top to all the way at the bottom, without losing a single vote. I agree it shouldn't be the very top thing, but this moderation also feels weird

mikehollinger · 2023-11-12T18:01:34.000000Z

Agreed. This isn’t actually that useful of a guide in the first place.

Tbh the most basic question is: “are you innovating inside the AI box or outside the AI box?”

If inside - this guide doesn’t really share anything practical. Like if you’re going to be tinkering with a core algorithm and trying to optimize it, understanding BLAS and cuBLAS or whatever AMD / Apple / Google equivalent, then understanding what pandas, torch, numpy and a variety of other tools are doing for you, then being able to wield these effectively makes more sense.

If outside the box - understanding how to spot the signs of inefficient use of resource - whether that’s network, storage, accelerator, cpu, or memory, and then reasoning through how to reduce that bottleneck.

Like - I’m certain we will see this in the near future, but off the top of my head the innocent but incorrect things people do: 1. Sending single requests, instead of batching 2. Using a synchronous programming model when asynchronous is probably better 3. Sending data across a compute boundary unnecessarily 4. Sending too much data 5. Assuming all accelerators are the same. That T4 gpu is cheaper than an H100 for a reason. 6. Ignoring bandwidth limitations 7. Ignoring access patterns

StableAlkyne · 2023-11-12T18:12:39.000000Z

Are there any surveys of just how many Windows Servers boxes exist?

Even when I was working at an Azure-only shop, I've never actually seen anyone use Windows Server. Lots of CentOS (before IBM ruined it) and other Unixes, but never a Windows Server.

lucb1e · 2023-11-12T18:18:06.000000Z

We come across them all the time when doing internal network pentests (most organizations use AD for managing their fleet of end-user systems), and occasionally external tests as well. Stackoverflow is a site that comes to mind as being known for running their production systems on Windows Server.

It's useful to have experienced, but I do take issue with exclusively (or primarily) focusing on one ecosystem as a mostly-publicly-funded school.

StableAlkyne · 2023-11-12T18:58:01.000000Z

Huh, TIL StackOverflow is on Windows Server

lordwiz · 2023-11-12T14:37:26.000000Z

In this AI Age, It is crucial for developers to have a fundamental understanding of GPUs and their application to AI development.

gmfawcett · 2023-11-12T15:37:27.000000Z

Crucial, for all developers? The great majority will get AI through an API.

athreyac8 · 2023-11-12T15:53:27.000000Z

After a while there will be a time when you will have to be lending others API, for that yes crucial I guess.