I'm probably missing something but why not use gpus for parallel processing?

nine_k · 2024-09-22T00:09:59 1726963799

GPUs work on massive amounts of data in parallel, but they execute basically the same operations every step, maybe skipping or slightly varying some steps depending on the data seen by a particular processing unit. But processing units cannot execute independent streams of instructions.

GPUs of course have several parts that can work in parallel, but they are few, and every part consists of large amounts that execute the same instruction stream simultaneously over a large chunk of data.

winwang · 2024-09-22T02:05:41 1726970741

This is not true. Take the NVidia 4090. 128 SMs = 4x128=512 SMSPs. This is the number of warps which can execute independently of each other. In contrast, a warp is a 32-width vector, i.e. 32 "same operations", and up to 512 different batches in parallel. So, it's more like a 512-core 1024-bit vector processor.

That being said, I believe the typical number of warps to saturate an SM is normally around 6 rather than 4, so more like 768 concurrent 32-wide "different" operations to saturate compute. Of course, the issue with that is you get into overhead problems and memory bandwidth issues, both of which are highly difficult to navigate around -- the register file storing all the register of each process is extremely power-hungry (in fact, the most power-hungry part I believe), for example.

A PPU with less vector width (e.g. AVX512) would have proportionally more overhead (possibly more than linearly so in terms of the circuit design). This is without talking about how most programs depend on latency-optimized RAM (rather than bandwidth-optmized GDDR/HBM).

nine_k · 2024-09-22T02:35:37 1726972537

I'm happy to stand corrected; apparently my idea about GPUs turned obsolete by now.

faangguyindia · 2024-09-22T13:58:57 1727013537

The Nvidia 4090 indeed has 128 SMs, but the formula you provided (128 SMs = 4x128=512 SMSPs) isn't quite accurate. Each SM contains 64 CUDA cores (not SMSPs), and these are the units responsible for executing the instructions from different warps. The term "SMSP" isn't typically used to describe CUDA cores or warps in Nvidia’s architecture.

rnrn · 2024-09-22T14:32:46 1727015566

winwang’s comment is correct, yours is wrong.

“cuda core” refers to one lane within the SIMT/SIMD ALUs. These lanes within a SMSP don’t execute independently.

The term SMSP is definitely used for nvidia’s architecture :

https://docs.nvidia.com/nsight-compute/ProfilingGuide/index....

> smsp

> Each SM is partitioned into four processing blocks, called SM sub partitions. The SM sub partitions are the primary processing elements on the SM.

(Note that this kernel profiling guide doesn’t use the term “cuda cores” at all)

Also there are 128 “cuda cores” per SM in 4090, not 64 : https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid...

> Each SM in AD10x GPUs contain 128 CUDA cores

rbanffy · 2024-09-23T07:22:17 1727076137

> So, it's more like a 512-core 1024-bit vector processor.

I’m disappointed nobody tried to make a computer with only GPU-like cores.

That it’s possible we already know - the initial bring-up of a Raspberry Pi is done by the GPU, before the ARM cores are started, so, for that brief moment, there is a GPU doing some very CPU things.

winwang · 2024-09-23T09:00:33 1727082033

I mean, you can definitely do CPU-like things on the GPU. Just... why would you make a bandwidth-optimized processor your primary processor, when most programs rely on the low-latency single-thread perf? In the case of parallelism, they are almost never vector-parallel (i.e. SIMD). Also, most modern processors have vector support, with AMD typically supporting AVX512 (already half the width of a GPU core), though I believe AVX is not quite as advanced(?) in terms of hardware intrinsics.

If you're interested in CPU-like GPU stuff, the term is GPGPU -- and I'm very interested in it myself.

winwang · 2024-09-23T09:02:12 1727082132

I mean, you can definitely do CPU-like things on the GPU. Just... why would you make a bandwidth-optimized processor your primary processor, when most programs rely on the low-latency single-thread perf? In the case of parallelism, they are almost never vector-parallel (i.e. SIMD). Also, most modern processes have vector support, with AMD typically supporting AVX512 (already half the width of a GPU core), though I believe AVX is not quite as advanced(?) in terms of hardware intrinsics. Vector instructions are also not free, typically requiring the core to downclock due to power/heat.

And then there's all the supportive smartness of the CPU like relatively large L1/L2 cache per core, huge branch predictors, etc. What I'm trying to get at is that GPUs aren't just dumb parallel CPUs, but rather require significantly different divergence from typical CPU architecture.

But if you're interested in CPU-like GPU stuff, the term is GPGPU -- and I'm very interested in it myself.

rbanffy · 2024-09-23T14:10:35 1727100635

GPUs have evolved to match their workloads, and, therefore, became large collections of wide SIMD devices tailored to run relatively simple computations repeatedly across long vectors.

What if we turned the idea upside down and allowed to be more complicated? Branches don’t work well with SIMD, but, when you run a SIMD processor over a vector of length one, you can start thinking of code with lots of branches, branch prediction, and of using the rest of the SIMD unit for speculative execution or superscalar operations. Now you have something that has characteristics of both, and a small piece of the GPU could be running scalar code efficiently. The memory system would need to be different, with perhaps lots of scratchpad memory on very wide buses with high throughput and high latency for the GPU-like workloads, and lower throughput/lower latency system for the more branchy workload. Explicit cache management seems to be a need here.

Not sure how it’d look like if someone implemented such a thing, but I’d love to see.

JackSlateur · 2024-09-21T23:52:09 1726962729

Because GPU are physically built to manage parallel task, but only a few kinds

They are very specialized

CPU are generics, they have lots of transistors to handle a lot of different instructions

markhahn · 2024-09-22T15:04:57 1727017497

I think this is a really unfortunate way to explain it. The issue is not that CPUs have a lot of different instructions - hardly anyone uses decimal math instructions, for instance, and no one cares about baroque complex addressing modes.

The difference is that GPU code is designed to tolerate latency by having lots of loop iterations treated as threads. A modern CPU tolerates latency by maintaining the readiness of hundreds of individual instructions ("in flight") - essentially focusing on minimizing the execution latency of each instruction. (which also explains how CPUs use caches and very high clocks, but wind up with somewhat fewer cores and threads.)

(note that I'm using cores and threads correctly here, not the nvidia way.)

Groxx · 2024-09-22T00:07:09 1726963629

Also moving data to and from the GPU takes MUCH more time than between CPU cores (though combined chips drastically lower this difference).

markhahn · 2024-09-22T15:07:27 1727017647

in the olden days of gp-cpu this was certainly true.

is it still true? do you just mean "latency overhead for setting up a single PCIe transaction is much larger than flinging a cache line across QPI/etc"?