"And we ask: if your matrix multiply is smaller than 16x16, are you sure what you’re doing is AI?
From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this."
The hardware needs of AI are starting to focus. GPUs, after all, were designed for an entirely different job. They're used for AI because they have good matrix multiply hardware. "AI GPUs" get to leave out some of the stuff in a real GPU (does an H100 even have texture fill units?). Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit? That will settle out at some point. This paper indicates that hardware that likes 16x16 tiles makes a lot of sense. It's certainly possible to build such hardware. Someone reading this is probably writing it in VHDL right now, or will be soon.
Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.
GPUs have evolved to be AI machines with as little baggage as possible. People have been arguing GPUs were old technology and therefore unsuited for AI since at least 2014 (when Nervana was founded), but what they perhaps didn’t expect is that the GPU would evolve so quickly to be an AI machine.
Bill Dally from Nvidia argues that there is "no gain in building a specialized accelerator", in part because current overhead on top of the arithmetic is in the ballpark of 20% (16% of IMMA and 22% for HMMA units)
https://www.youtube.com/watch?v=gofI47kfD28
There does seem to be a somewhat obvious advantage: If all it has to do is matrix multiplication and not every other thing a general purpose GPU has to be good at then it costs less to design. So now someone other than Nvidia or AMD can do it, and then very easily distinguish themselves by just sticking a ton of VRAM on it. Which is currently reserved for GPUs that are extraordinarily expensive, even though the extra VRAM doesn't cost a fraction of the price difference between those and an ordinary consumer GPU.
And, sure enough, there's a new AI chip from Intellifusion in China that's supposed to be 90% cheaper. 48 TOPS in int8 training performance for US$140.[1]
I wonder what the cost of power to run these chips is. If the power cost ends up being large compared to the hardware cost, it could make sense to buy more chips and run them when power is cheap. They could become a large source of dispatchable demand.
Int8 training has very few applications, and int8 ops generally are very easy to implement. Int8 is a decent inference format, but supposedly doesn't work well for LLMs that need a wide dynamic range.
There are other operations for things like normalization in training, which is why most successful custom stuff has focused on inference I think. As architectures changed and needed various different things some custom built training hardware got obsoleted, Keller talked about that affecting Tesla's Dojo and making it less viable (they bought a huge nvidia cluster after it was up). I don't know if TPU ran into this, or they made enough iterations fast enough to keep adding what they needed as they needed it.
Someone does, in fact, have to implement everything underneath that `import` call, and that work is _very_ hard to do for things that don't closely match Nvidia's SIMT architecture. There's a reason people don't like using dataflow architectures, even though from a pure hardware PoV they're very powerful -- you can't map CUDA's, or Pytorch's, or Tensorflow's model of the world onto them.
AI models are not all matrix multiplications, and they tend to involve other operations. Also, they change super fast, much faster than hardware cycles, so if your hardware isn't general-purpose enough, the field will move past you and obsolete your hardware before it comes out.
AI models are mostly matrix multiplications and have been that way for a few years now, which is longer than a hardware cycle. Moreover, if the structure changes then the hardware changes regardless of whether it's general purpose or not, because then it has to be optimized for the new structure.
Everybody cares about VRAM right now yet you can get a P40 with 24GB for 10% of the price of a 24GB RTX 4090. Why? No tensor cores, the things used for matrix multiplication.
I really hope we see AI-PU (or with some other name, INT16PU, why not) for the consumer market sometime soon. Or been able to expand GPU memory using a pcie socket (not sure if technically possible).
My uninformed question about this is why can't we make the VRAM on GPUs expandable? I know that you need to avoid having the data traverse some kind of bus that trades overhead for wide compatibility like PCIe but if you only want to use it for more RAM then can't you just add more sockets whose traces go directly to where they're needed? Even if it's only compatible with a specific type of chip it would seem worthwhile for the customer to buy a base GPU and add on however much VRAM they need. I've heard of people replacing existing RAM chips on their GPUs[0] so why can't this be built in as a socket like motherboards use for RAM and CPUs?
Expandable VRAM on GPUs has been tried before - the industry just hates it. It's like Apple devices - want more internal storage? Buy a new computer so we can have the fat margins.
The original REV A iMac in late 90s had slotted memory for its ATI card, as one example - shipped with 2mb, could be upgraded to 6mb after the fact with a 4MB SGRAM DIMM. There are also a handful of more recent examples floating around.
While I'm sure there are also packaging advantages to be had by directly soldering memory chips instead of slotting them etc, I strongly suspect the desire to keep buyers upgrading the whole card ($$$) every few years trumps this massively if you are a GPU vendor.
Put another way, what's in it for the GPU vendor to offer memory slots? Possibly reduced revenue, if it became industry norm.
Expansion has to answer one fundamental question: if you're likely to need more X tomorrow, why aren't you just buying it today?
The answer to this question almost has to be "because it will be cheaper to buy it tomorrow." However, GPUs bundle together RAM and compute. If RAM is likely to be cheaper tomorrow, isn't compute also probably going to be cheaper?
If both RAM and compute are likely cheaper tomorrow, then the calculus still probably points towards a wholesale replacement. Why not run/train models twice as quickly alongside the RAM upgrades?
> I strongly suspect the desire to keep buyers upgrading the whole card ($$$) every few years trumps this massively if you are a GPU vendor.
Remember as well that expandable RAM doesn't unlock higher-bandwidth interconnects. If you could take the card from five years ago and load it up with 80 GB of VRAM, you'd still not see the memory bandwidth of a newly-bought H100.
If instead you just need the VRAM and don't care much about bandwidth/latency, then it seems like you'd be better off using unified memory and having system RAM be the ultimate expansion.
> The answer to this question almost has to be "because it will be cheaper to buy it tomorrow."
No, it doesn't. It could just as easily be "because I will have more money tomorrow." If faster compute is $300 and more VRAM is $200 and I have $300 today and will have another $200 two years from now, I might very well like to buy the $300 compute unit and enjoy the faster compute for two years before I buy the extra VRAM, instead of waiting until I have $500 to buy both together.
But for something which is already a modular component like a GPU it's mostly irrelevant. If you have $300 now then you buy the $300 GPU, then in two years when you have another $200 you sell the one you have for $200 and buy the one that costs $400, which is the same one that cost $500 two years ago.
This is a much different situation than fully integrated systems because the latter have components that lose value at different rates, or that make sense to upgrade separately. You buy a $1000 tablet and then the battery goes flat and it doesn't have enough RAM, so you want to replace the battery and upgrade the RAM, but you can't. The battery is proprietary and discontinued and the RAM is soldered. So now even though that machine has a satisfactory CPU, storage, chassis, screen and power supply, which is still $700 worth of components, the machine is only worth $150 because nothing is modular and nobody wants it because it doesn't have enough RAM and the battery dies after 10 minutes.
hmm seems you're replying as a customer, but not as a GPU vendor...
the thing is, there's not enough competition in the AI-GPU space.
Current only option for no-wasting-time on running some random research project from github? buy some card from nvidia. cuda can run almost anything on github.
AMD gpu cards? that really depends...
and gamers often don't need more than 12?gb of GPU ram for running games on 4k.. so most high-vram customers are on the AI field.
> If you could take the card from five years ago and load it up with 80 GB of VRAM, you'd still not see the memory bandwidth of a newly-bought H100.
this is exactly what nvidia will fight against tooth-and-nail -- if this is possible, its profit margin could be slashed to 1/2 or even 1/8
Replacing RAM chips on GPUs involves resoldering and similar things - those (for the most part) maintain the signal integrity and performance characteristics of the original RAM. Adding sockets complicates the signal path (iirc), so it's harder for the traces to go where they're needed, and realistically given a trade-off between speed/bandwidth and expandability I think the market goes with the former.
The problem with GPUs is they're designed to be saturated.
If you have a CPU and it has however many cores, the amount of memory or memory bandwidth you need to go with that is totally independent, and memory bandwidth is rarely the bottleneck. So you attach a couple memory channels worth of slots on there and people can decide how much memory they want based on whether they intend to have ten thousand browser tabs open or only one thousand. Neither of which will saturate memory bandwidth or depend on how fast the CPU is, so you don't want the amount of memory and the number of CPU cores tied together.
If you have a device for doing matrix multiplications, the amount of RAM you need is going to depend on how big the matrix you want to multiply is, which for AI things is the size of the model. But the bigger the matrix is, the more memory bandwidth and compute units it needs for the same number of tokens/second. So unlike a CPU, there aren't a lot of use cases for matching a small number of compute units with a large amount of memory. It'd be too slow.
Meanwhile the memory isn't all that expensive. For example, right now the spot price for 64GB of GDDR6 is less than $200. Against a $1000 GPU which is fast enough for that much, that's not a big number. Just include it to begin with.
Except that they don't. The high end consumer GPUs are heavy on compute and light on memory. For example, you can get the RTX 4060Ti with 16GB of VRAM. The RTX 4090 has four times as much compute but only 50% more VRAM. There would be plenty of demand for a 4090 that cost $200 more and had four times as much VRAM, only they don't make one because of market segmentation.
Obviously if they don't do that then they're not going to give one you can upgrade. But you don't really want to upgrade just the VRAM anyway, what you want is for the high performance cards to come with that much VRAM to begin with. Which somebody other than Nvidia might soon provide.
Technically we definitely can, but are there sufficiently many people willing to pay a sufficiently high premium for that feature? How much more would you be willing to pay for an otherwise identical card that has the option to expand RAM, and do you expect that a significant portion of buyers would want to pay a non-trivial up-front cost for that possibility?
> Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.
Apple has already been doing this for a few years now. The NPU is totally different from the GPU or CPU on the die itself[1]. Nvidia is likely working on this as well, but I think a device that's a gaming/entertainment/crypto/AI bundle (i.e. sticking with the video card) is probably a better business move.
The NPUs on a lot of different systems occupy an awkward spot. For extremely small models, they're the way to go for low-power inference. But once you reach LLM or vision transformer size, it makes a lot more sense to switch to GPU shaders for that extra bit of large-model performance. For stuff like Llama and Stable Diffusion, those Neural Engines are practically wasted silicon. The biggest saving grace is projects like ONNX attempting to sew them into a unified non-15-competing-standards API, but even that won't change how underpowered they are.
Nvidia escapes this by designing their GPU architecture to incorporate NPU concepts at a fundamental level. It's less redundant silicon and enables you to scale a single architecture instead of flip-flopping to whichever one is most convenient.
It's currently doable for Apple – I think their strategy is to slowly enhance iPhones, bit by bit, with special-purpose models for dealing with media like photo subject identification, OCR (in every language!), voice transcription, etc. Apple's currently learning from Microsoft's attempts to make AI stick everywhere.
I think Apple is more interested in features that work consistently than in giving power users the ability to play with essentially alpha or beta AI features.
I would guess that their strategy is to not include powerful client-side hardware, and supplement that with some kind of "AiCloud" subscription to do the battery-draining, heat-generating stuff on their cloud. They're trading off their branding as a privacy focused company under the (probably correct) belief that people will be more willing to upload their data to iCloud's AI than Microsoft's.
Fwiw, I think they're probably correct. It has always struck me as odd that people want to run AI on their phone. My impression of AI is that it creates very generalized solutions to problems that would be difficult to code, at the cost of being very compute inefficient.
I don't really want code like that running on my phone; it's a poor platform for it. Thermal dissipation and form factor limit the available processing power, and batteries limit how long you can use the processing power you have. I don't really want to waste either trying to do subject identification locally. I'm going to upload the photos to iCloud anyways; let me pay an extra $1/month or whatever to have that identification happen in the cloud, on a server built for it that has data center thermal dissipation and is plugged into the wall.
>I'm going to upload the photos to iCloud anyways; let me pay an extra $1/month or whatever to have that identification happen in the cloud, on a server built for it that has data center thermal dissipation and is plugged into the wall.
You might not be in area of poor connection and can't connect to the cloud.
One use for AI is speech recognition / transcription for deaf/HoH individuals. Up until now its almost been done exclusively on the cloud and it works fairly well (depending on conditions). Recently there's been an interest in doing it locally without relying on a network connection.
There's also privacy issues with transmitting this data over a network.
> It has always struck me as odd that people want to run AI on their phone. My impression of AI is that it creates very generalized solutions to problems that would be difficult to code, at the cost of being very compute inefficient.
I don't equate AI with coding. I want AI locally for photo sorting and album management, for general questions answering/list making that I use GPT for, and any number of other things.
I try not to upload personal data to sites that aren't E2E encrypted, so iCloud/Google photos is a no-go.
The pinch (as far as I can see it) is that you're right, and Apple can't sell a freestanding service to save their life. If we do get an AppleGPT pay-as-you-go service, it's certain to be extraordinarily censored and locked-down as the exclusive first-party option on iPhone. It will feature "vertical integration" that no other AI can have, alongside censorship so prudish that it would make Maurey Povich gasp.
So... I think users will be stuck. They'll want to run uncensored models on their phone, but Apple will want to keep them in the walled garden at any cost. It feels like the whole "Fortnite" situation all over again, where users can agree they want something but Apple can't decide.
Anyone checked out the NPU on the new iPad? It’s supposed to be a bazillion times better according to Apple but I haven’t had a chance to dig into the reality.
I guess we can assume this is going to be what’s used in what’s being called Apple’s first AI phone, iPhone 16.
That 38 TOPS figure was a bit weird, it's literally below baseline(45 TOPS) for "AI PC" branding Qualcomm/Intel/Microsoft is launching this June, and also 10x less than typical GPUs. I think it was just a clever marketing exploiting the fact that "AI PC" branding hasn't launched yet.
For reference, Nvidia's Jetson Orin NX robotics platform is 35-50 TOPS on average. Apple is catching up, but Nvidia still has by-far the more flexible (and better scaled) platform.
On kernels such as flash attention, TMA and the L2 cache are both fast enough so as to hide these problems reasonably well. But to make the full use of the hardware, memory request must be coalesced and bank conflicts avoided
”
The depth of the competition is also starting to become apparent. There’s no way the documentation error was totally an accident. Diagrams are the easiest to steal / copy and there must have been some utility for nvidia to have left this in place. Remember when Naveen Rao’s Nervana was writing NVidia Maxwell drivers that out-performed NVidia’s own? Not every documentation mishap in a high-growth product is a competition counter-measure, but given that the researchers spent so long reverse-engineering wgmma and given the China-US political situation of the H100 in particular, it seems NVidia is up to its old tricks to protect its moat.
So don’t over-study the H100 peculiarities, as “what hardware does AI want?” really encompasses the commercial situation as well.
I don't understand. If they document their stuff with errors, it will hurt users, be they chinese or US ? Or is it expected that US users will call Nvidia's to ask for the correct documentation ?
It could be a case of classic market segmentation. The lower tier customers get the incomplete or error-ridden documentation, and the upper tier trusted customers^W'partners' get access to the juicy stuff: complete and mostly correct documentation, including stuff intentionally left out of the lower tier package like application notes containing secret hardware handshakes to unlock hidden features, all under strict NDA of course.
The vast majority of users use NVidia’s own kernels versus optimize their own. And those who do write custom kernels are typically not trying to compete with NVidia’s own GMM.
That's a good point - floating point operations are implemented with integer-math circuits (or at least can be - I'm not privy to how modern chip manufacturers implement them). E.g: your ALU may have an 11-bit adder specifically to add your f64 exponents.
Wait but nvidia tensor-cores are exactly the hardware that likes 16x16 tiles, no? I thought that was the whole point? The hardware is already here and I'm sceptical if there is another order of magnitude in performance to be gained from even more specialized designs.
Knowing what portion of the FLOPs are in the tensor cores isn't quite the right thing to be looking at. The key question is how much more tensor core performance can be gained by reducing or eliminating the dies area devoted to non-tensor compute and higher precision arithmetic. Most of NVIDIA's GPUs are still designed primarily for graphics: they have some fixed function units that can be deleted in an AI-only chip, and a lot of die space devoted to non-tensor compute because the tensor cores don't naturally lend themselves to graphics work (though NVIDIA has spent years coming up with ways to not leave the tensor cores dark during graphics work, most notably DLSS).
So the claims that NVIDIA's GPUs are already thoroughly optimized for AI and that there's no low-hanging fruit for further specialization don't seem too plausible, unless you're only talking about the part of the datacenter lineup that has already had nearly all fixed-function graphics hardware excised. And even for Hopper and Blackwell, there's some fat to be trimmed if you can narrow your requirements.
Some fraction of your transistors MUST go unused on average or you melt the silicon. This was already a thing in the 20nm days and I'm sure it has only gotten worse. 100% TDP utilization might correspond to 60% device utilization.
That's true for CPUs. Does it really apply to GPUs and other accelerators for embarrassingly parallel problems where going slower but wider is always a valid option?
And yet, even NVIDIA does trim it from chips like the H100, which has no display outputs, RT cores, or video encoders (though they keep the decoders), and only has ROPs for two of the 72 TPCs.
it's going to be awkward in consumer hardware either way
if you segregate AI units from the GPU, the thing is both AI and GPUs will continue to need massive amounts of matrix multiplication and as little memory latency as possible
the move to have more of it wrapped in the GPU makes sense but at least in the short and medium term, most devices won't be able to justify the gargantuan silicon wafer space/die growth that this would entail - also currently Nvidia's tech is ahead and they don't make state of the art x86 or ARM CPUs
for the time being I think the current paradigm makes the most sense, with small compute devices making inroads in the consumer markets as non-generalist computers - note that more AI-oriented pseudo-GPUs already exist and are successful since the earlier Nvidia Tesla lineup and then the so-called "Nvidia Data Center GPUs"
> Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit?
There was that recent paper titled "The Era of 1-bit LLMs" [0] which was actually suggeting a 1.58 bit LLM (2 bits in practice).
> Someone reading this is probably writing it in VHDL right now, or will be soon.
Yeah, I think I'm in the "will be soon" camp - FPGA board has been ordered. Especially with the 2-bit data types outlined in that paper [0] and more details in [1]. There's really a need for custom hardware to do that 2-bit math efficiently. Customizing one of the simpler open source RISC-V integer implementations seems like something to try here adding in the tiled matrix registers and custom instructions for dealing with them (with the 2 bit data types).
> NVIDIA’s lies. This is an extraordinarily misleading representation of the actual 128b swizzled wgmma layout. This diagram cost us three weeks of life that we will not get back, hence the public shaming.
Wondering if anyone would be surprised that a huge amount of progress in AI is on the engineering side (optimizing matmuls), and that a huge portion of the engineering is about reverse engineering NVIDIA chips
Architecture doesn't make a difference. Big enough models trained with big enough data tend to give the same results regardless of architecture. So yes, most advances in AI are mostly due to the fact we can now multiply matrices very fast.
That's not completely true. The architecture must behave well for scaling, which is not trivial. Basic multi-layer perceptrons do not scale well for example, the gradient will vanish or explode deeper in the network.
How do modern foundation models avoid multi-layer perceptron scaling issues? Don't they have big feed-forward components in addition to the transformers?
They rely heavily on what we call residual or skip connexions. This means each layer does something like x = x + f(x). This helps the training a lot, ensuring the gradient can flow nicely in the whole network.
This is heavily used in ResNets (residual networks) for computer vision, and is what allows training much deeper convolutional networks. And transformers use the same trick.
I'm in the industry and nobody does that since over ten years. There was just a small phase when Hinton published "Greedy layer-wise training of deep networks" in 2007 and people did it for a few years at most. But already with the rise of LSTMs in the 2010s this wasn't done anymore and now with transformers also not. Would you care to share how you reached your conclusion as it matches 0 of my experience over the last 15 years and we also train large-scale LLMs in our company. There's just not much point to it when gradients don't vanish.
Not easy to give a concise answer here, but let me try:
The problem mainly occurs in networks with recurrent connections or very deep architectures. In recurrent architectures this was solved via LSTMs with the signal gates. In very deep networks, e.g. ResNet, this was solved via residual connections, i.e. skip connections over layers. There were also other advances, such as replacing sigmoid activations with the simpler ReLU.
Transformers, which are the main architecture of modern LLMs, are highly parallel without any recurrence, i.e. at any layer you still have access to all the input tokens, whereas in an RNN you process one token at a time. To solve the potential problem due to "deepness" they also utilize skip connections.
idk, they do give the same results, but given the memory bottleneck it feels like we are at a point when architecture innovations matter again, for example check out DeepSeek V2 tech report, they modded model arch specifically for lower cost inference (by making k/v cache smaller)
There was some awareness reading the article, yet "we're warping through the quadrant in our tensor accelerator" is pretty Trek.
Have had that thought occasionally with some of the other articles. What it must read like to somebody who gets a ref link for an article over here. Wandered into some Trek nerd convention discussing warp cores.
I believe that reducing the power consumption and increasing the speed of AI inference will be best served by switching to analog, approximate circuits. We don't need perfect floating-point multiplication and addition, we just need something that takes an two input voltages and produces an output voltage that is close enough to what multiplying the input voltages would yield.
I know someone working in this direction; they've described the big challenges as:
* Finding ways to use extant chip fab technology to produce something that can do analog logic. I've heard CMOS flash presented a plausible option.
* Designing something that isn't an antenna.
* You would likely have to finetune your model for each physical chip you're running it on (the manufacturing tolerances aren't going to give exact results)
The big advantage is that instead of using 16 wires to represent a float16, you use the voltage on 1 wire to represent that number (which plausibly has far more precision than a float32). Additionally, you can e.g. wire two values directly together rather than loading numbers into an ALU, so the die space & power savings are potentially many, many orders of magnitude.
> which plausibly has far more precision than a float32
If that was true, then a DRAM cell could represent 32 bits instead of one bit. But the analog world is noisy and lossy, so you couldn't get anywhere near 32 bits of precision/accuracy.
Yes, very carefully designed analog circuits can get over 20 bits of precision, say A/D converters, but they are huge (relative to digital circuits), consume a lot of power, have low bandwidth as compared to GHz digital circuits, and require lots of shielding and power supply filtering.
This is spit-balling, but the types of circuits you can create for a neural network type chip is certainly under 8 bits, maybe 6 bits. But it gets worse. Unlike digital circuits where signal can be copied losslessly, a chain of analog circuits compounds the noise and accuracy losses stage by stage. To make it work you'd need frequent requantization to prevent getting nothing but mud out.
You can get 8bit analog signal resolution reasonablyish easyish. The Hagen mode [1] of BrainScaleS [2] is essentially that. But.. yeah. No way in hell you are getting more than 16bit with that kind of technology, let alone more.
And those things are huge which lead to very small network sizes. This is partially due to the fabrication node, but also simply because there is even less well developed tooling for analog circuits compared to digital ones compared to software compilers
> which plausibly has far more precision than a float32
+/- 1e-45 to 3.4e38. granted, roughly half of that is between -1 and 1.
When we worked with low power silicon, much of the optimization was running with minimal headroom - no point railing the bits 0/1 when .4/.6 will do just fine.
> Additionally, you can e.g. wire two values directly together rather than loading numbers into an ALU
You may want an adder. Wiring two circuit outputs directly together makes them fight, which is usually bad for signals.
an analog value in such a chip has far, far less resolution than a float32. Maybe you get 16 bits of resolution, more likely 8, and your multiplications are going to be quite imprecise. The whole thing hinges on the models being tolerant of that.
I think we're far away from analog circuits being practically useful, but one place that where we might embrace the tolerance for imprecision is in noisy digital circuits. Accepting that one in a million, say, bits in an output will be flipped to achieve a better performance/power ratio. Probably not when working with float32s where a single infinity[1] could totally mess things but for int8s the occasional 128 when you wanted a 0 seems like something that should be tolerable.
[1] Are H100s' maxtrix floating point units actually IEEE 754 compliant? I don't actually know.
I'd go a step further, something which resembles how "wet brains" (biological) actually work, but which could be produced easily.
Biological neural networks are nowhere near as connected as ANNs, which are typically fully connected. With biological neurons, the ingress / egress factors are < 10. So they are highly local
It is also an entirely different model, as there is no such thing as backpropagation in biology (that we know of).
What they do have is lieu of backpropagation is feedback (cycles)
And maybe there are support cells/processes which are critical to the function of the CNS that we don't know of yet.
There could also be a fair amount of "hard coded" connectedness, even at the higher levels. We already know of some. For instances, it is known that auditory neurons in the ears are connected and something similar to a "convolution" is done in order to localize sound source. It isn't an a emergent phenomena - you don't have to be "trained" to do it.
This is not surprising give life has had billions of years and a comparable number of generations in order to figure it out.
I guess in theory this could all be done in software. However given the trillion+ neurons in primate/human brains, this would be incredibly challenging on even the thousand-core machines we have nowadays. And before you scream "cloud" it would not have the necessary interconnectedness/latency.
It would be cool if you could successful model say, a worm/insect with this approach.
> What they do have is lieu of backpropagation is feedback (cycles)
I wonder where the partial data / feedback is stored. Don't want to sound like a creationist, but it seems very improbable that "how good my sound localization is" is inferred exclusively from the # of children I have.
What do you mean with inpossible? You are aware that what radio equipment does is often equivalent of analog operations like multiplication, addition, etc. just at high frequencies?
Sure accuracy is an issue, but this is not as impossible as you may think it would be. The main question will be if the benefits by going analog outweigh the issues arising from it.
In general the problem with analog is that every sequential operation introduces noise. If you're just doing a couple of multiplications to frequency shift a signal up and down that's fine. But if you've got hundreds of steps and you're also trying to pack huge numbers of parallel steps into a very small physical area.
Realistically, you'd train your model the same way it's done today and then custom-order analog ones with the weights programmed in. The advantage here would be faster inference (assuming analog circuits actually work out), but custom manufacturing circuits would only really work at scale.
I don't think reprogrammable analog circuits would really be feasible, at least with today's tech. You'd need to modify the resistors etc. to make it work.
Maybe because that is a VERY different problem than the one discussed here.
Building a single analog chip with 1 billion neurons would cost billions of dollars in a best case scenario. A Nvidia card with 1 billion digital neurons is in the hundreds of dollars of range.
Those costs could come down eventually, but at that point CUDA may be long gone.
Have you done much AI work against AMD products? I'm not going to plunk down $2500+ for an RTX 4090, but have been considering an RX 7900XTX for playing around with, or at least getting started. Just curious how well it will or won't work in practice, or if saving a bit more and getting a 7900 XT over the XTX might be a better option, and how much less vram might impact usefulness in practice.
My only work with consumer AMD GPUs was mining ethereum, I had 150,000 of them.
If you want to use enterprise AMD gpus, I'm renting them. That said, I haven't even had a chance to run/play with them myself yet, they have been rented since I got them last month.
Caveat emptor and your mileage may vary; but unlike nVidia where you could just assume that everything is compatible with everything, for AMD I'd strongly recommend that you try before you buy - consider renting a cloud machine with that GPU to check if the software works for your needs before committing to a large purchase.
Good writing is clear and unambiguous. With speech there is an opportunity to interrupt and ask for clarification. Writing has one chance to get the message across. A reader shouldn't have to consult knowyourmeme.com to figure out what the heck the authors are trying to say. I don't even know what the title means here. That's how far they've missed the mark.
Wow that really sucks for you. I just read it in 5 minutes and feel much more informed about the subject pf nvidia memory twizzlization. It's kind of funny to me that presumably young college guys are writing in a style that's very readable for my old ass.
Even if you're not familiar with the "go brrr" meme (which is the only use of meme-idiom in the article and is used exactly twice), its meaning is easily inferred via context clues from the opening paragraphs.
As someone who witnessed A-10 CAS fuck some stuff up in a combat zone ie the real “brrrrt” I’ve been mystified by the meme and current useage. No one knows where it comes from nor the slaughter it represents.
as intense as a a10 might be, it's short lived and only affects a few dudes on the receiving end. When the federal reserve goes brrr, it has far reaching impact that affects every single person in the global economy.
I also enjoyed the article's style. I utterly despise "academic paper speak". It is, imho, not the most effective style to communicate complex ideas. I find it so much easier to learn from a more casual "blog post" or in-person presentation over stiff, rigid academic speak.
I find both to be useful in different stages. The casual style is very helpful when starting out. But once I have put in a few weeks or months of study in, then the rigor and preciseness of academic style is good as well.
I agree with you in the sense that something has "died" in writings the follow academic paper speak these days. Just yesterday I saw an ancient article surfaced by Scientific American and Peter Norvig on System Analysis by Strachey. It uses quite a bit of formal language but is super approachable at the same time. That kind of skill is rarely seen these days.
Also, write your own cuda kernel to do vector-matrix multiplication (if you use pycuda, you can focus on the kernel, and write everything else with python). Just tell chatgpt that you want to write your own implementation that multiplies a 4000-element vector by 4000x12000 matrix, and to guide you through the whole process.
For renting gpus, runpods is great - right now they have everything from lower tier gpus to h100s. You can start with a lesser gpu at the beginning.
Never tried those, so I couldn't say. I guess it would.
Even so, creating all the abstractions needed to implement even regular matrix multiplication in Spiral in a generic fashion took me two months, so I'd consider that good enough exercise.
You could do it a lot faster by specializing for specific matrix sizes, like in the Cuda examples repo by Nvidia, but then you'd miss the opportunity to do the tensor magic that I did in the playlist.
NNs for example are (mostly) a sequence of matrix multiplication operations, and GPUs are very good at those. Much better than CPUs. AI is hot at the moment, and Nvidia is producing the kind of hardware that can run large models efficiently which is why it's a 2 trillion-dollar company right now.
However, in the Spiral series, I aim to go beyond just making an ML library for running NN models and break new ground.
Newer GPUs actually support dynamic memory allocation, recursion, and the GPU threads have their own stacks, so you could in fact treat them as sequential devices and write games and simulators directly on them. I think once I finish the NL Holdem game, I'll be able to get over 100x fold improvements by running the whole program on the GPU versus the old approach of writing the sequential part on a CPU and only using the GPU to accelerate a NN model powering the computer agents.
I am not sure if this is a good answer, but this is how GPU programming would be helpful to me. It all comes down to performance.
The problem with programming them is that the program you are trying to speed up needs to be specially structured, so it utilizes the full capacity of the device.
would be interested to see thunderkittens (great name!) tackle the flash attention backwards pass, which is an order of magnitude harder than the forward
good news - we've actually included optimized causal and non-causal versions of the flash attention backwards pass with TK - would love for you to check them out!
Hasn't this research been done by teams building NPUs today? E.g. chips built by Groq use an architecture built specifically for AI, which is why they're able to deliver the performance they do. On the consumer side, Apple silicon is also quite capable.
I'm not in this field at all, but it seems to me that using general purpose processors that communicate over (relatively) slow lanes can only get us so far. Rethinking the design at the hardware level, and eventually bringing the price down for the consumer market seems like a better long-term strategy.
>On the consumer side, Apple silicon is also quite capable.
I am not sure that is true. A glance/or long stay at the reddit localllama subreddit basically has a bunch of frustrated CPU users trying their absolute best to get anything to work at useful speeds.
When you can get an Nvidia GPU for a few hundred dollars or a full blown gaming laptop with a 4050 6gb vram for $900, its hard to call a CPU based AI capable.
Heck we don't have GPUs at work, and CPU based is just not really reasonable without using tiny models and waiting. We ended up requesting GPU computers.
I think there is a 'this is technically possible', and there is a 'this is really nice'. Nvidia has been really nice to use. CPU has been miserable and frustrating.
Actually, llama.cpp running on Apple silicon uses GPU(Metal Compute Shader) to inference LLM models. Token generation is also very memory bandwidth bottlenecked. On high end Apple silicon it's about 400MB/s to 800MB/s, comparable to NVIDIA RTX 4090, which has memory bandwidth of 1000MB/s. Not to mention that Apple silicon has unified memory architecture and has high memory models (128GB, up to 192GB), which is necessary to run large LLMs like Llama 3 70B, which roughly takes 40~75GB of RAM to work reasonably.
The number of people running llama3 70b on NVidia gaming GPUs is absolutely tiny. You're going to need at least two of the highest end 24 GB VRAM GPUs and even then you are still reliant on 4 bit quantization with almost nothing left for your context window.
I don't think NVIDIA's reign will last long. The recent AI resurgence is not even a decade old. We can't expect the entire industry to shift overnight, but we are seeing rapid improvements in the capability of non-GPU hardware to run AI workloads. The architecture change has been instrumental for this, and Apple is well positioned to move the field forward, even if their current gen hardware is lacking compared to traditional GPUs. Their silicon is not even 5 years old, yet it's unbeatable for traditional workloads and power efficiency, and competitive for AI ones. What do you think it will be capable of in 5 years from now? Same for Groq, and other NPU manufacturers. Betting on NVIDIA doesn't seem like a good long-term strategy, unless they also shift their architecture.
Absolutely, it's like living in London and eventually having to accept that tourists will always say "Big Ben" when they mean the clock tower of the Palace of Westminster, which encloses the bell whose actual name is Big Ben. The name of the tower is, de facto, Big Ben, and life gets so much easier when you drop the urge to tell people they are wrong all the time...
Edit: TIL the tower was properly renamed "Elizabeth Tower" in 2012 [0] but I seriously doubt a single person in the last 12 years has ever used that name...
I am missing the reference to the canadian goose and the retriever puppy as spirit animals. Is that to say the H100 is an ornery thing, but the RTX4090 is friendly?
I’d assumed (like you) it meant that the H100 is ornery AND pickier about what it consumes, while the RTX4090 is playful and will eat damn near anything within reach of its mouth (with its sharp, velociraptor-like puppy teeth), whether you want it to or not.
I consider bad the habit of English to use nouns also as adjectives, because it causes many ambiguities, some of which can be very annoying, even if they are a rich source of jokes and word plays.
In most languages the use of a noun as an adjective is marked, by a particle or by an affix or at least by a different stress pattern (like moving the stress to the last syllable), which removes the ambiguities.
So for most non-native speakers "Canadian goose" makes much more sense than "Canada goose" (which may feel like "Canada and a goose" or "a goose that is also Canada" and not like "a goose from Canada").
always the former noun is describing the latter. Butter fly is not a flying butter (as my children's teacher told them to make a joke about butterfly) but a fly made of butter instead.
In the names "Canada Goose", "Long Island Shellfish" and "Dublin Bay Prawns", "Canada", "Long Island" and "Dublin Bay" are adjectives, because geese are not also "Canada", shellfish are not also "Long Island" and prawns are not also "Dublin Bay".
This kind of names is typical for English, but not for most other languages.
For instance, the scientific name of the Canada goose is "canadensis", which means "Canadian", not "Canada".
An adjective (in the broad sense) is a word that describes a subset of the set named by the noun to which it is attached.
While most languages also include distinct words that are adjectives in the narrow sense, i.e. which have degrees of comparison, adjectives in the broad sense (sometimes called relational adjectives) can be derived from any noun by various means, e.g. genitive case markers, prepositions, postpositions, suffixes, prefixes or accentual patterns, except for ambiguous languages like English, where any noun can also be used as an adjective, and sometimes also as a verb.
"For this post, we’re going to focus on the NVIDIA H100 [... because] we think the trends it implies are going to continue in future generations, and probably from other manufacturers, too."
Is it though? Wouldn't we expect to see more advanced packaging technology eventually?
If that happens the increased memory bandwidth could be an enabler for a unified memory architecture like in the Nvidia Jetson line. In turn that would make a lot of what the article says make GPU go Brr today moot.
One of my biggest struggles in doing AI stuff on consumer hardware is heat. I noticed zero discussion of this so I assume it's an implementation detail on small systems that doesn't really factor into more robust setups. Is that the really case, or is this just diving into the comp sci layer of hardware utilization and ignoring things like heat because it's not salient to this subtopic?
It factors into robust setups but is part and parcel of doing any HPC where you're pushing through a ton of TFLOPS. It's a problem that is assumed to have been solved when you're doing this kind of work.
NVIDIAs stock will plummet in 3-4 years after Microsoft and Meta stop spending tens of billions without having a specific use for H100's and end up with a ridiculous amount of excess capacity. Hopefully, that means some H100-based systems will end up on eBay in ~5-8 years for home lab use.
I'm sure they will. Right now it's, though, it's bleeding edge and it'll take some time for these ideas to mature and be adapted to the particular idioms of these more stable packages.
That’s the whole point, VCs invested heavily in GPUs anticipating a crypto boom and when that never happened they had to find some other snake oil to peddle that happened to require GPUs.
My experience is that when crypto was in the news, my non-technical friends, family, and colleagues would ask me what is bitcoin and were generally confused.
My experience with the AI boom couldn't be more different - everyone from my colleagues to my mum are using chatgpt as a daily tool.
I really don't think that AI and crypto are comparable in terms of their current practical usage.
Comparing crypto and AI is really tired and you make the best point - real people are using these GPUs to actually do things of value and improve their daily lives.
At the peak of the crypto boom/hype cycle I took on a little project to look at the top 10 blockchain networks/coins/whatever.
From what I could tell a very, very, very generous estimate is that crypto at best has MAUs in the low tens of millions.
ChatGPT alone got to 100 million MAUs within a year of release and has only grown since.
ChatGPT 10x'd actual real world usage of GPUs (and resulting power and other resources) in a year vs ~15 years for crypto.
> I really don't think that AI and crypto are comparable in terms of their current practical usage.
GPUs stopped being used for crypto because Ethereum switched from PoW to PoS and that decimated the whole gpu mining industry. Ethereum was the only profitable thing to mine, that also had a usecase. The rest of the chains dumped in price and became unprofitable to mine at scale. Not enough market depth to unload the tokens at scale.
> GPUs in AI provide far more value and utility than crypto ever has regardless of how/if it’s mined with them or not.
Two points:
1. GPUs for AI != GPUs for crypto. They are not the same systems. There is a tiny bit of overlap, but most people bought underpowered gpus that were focused on ROI. For example, you didn't need more than 8GB ram. You also didn't need ultrafast networking.
Wow, what a different in perspective. I've met maybe a few people period that have at least mentioned that they've ever used AI tools (ever) in their personal lives, frequency be damned. Maybe you're just a lot more insistent in weaving in questions about using AI tools in daily conversation.
At work setting in a tech company, there seems to be a handful that are very in love with AI, a bunch that use it here or there, and a large majority that (at least publically) don't even use it. It's be interesting to see what company enforced spyware would say about ai uptake though for real.
NVIDIA is so damn good at its job that it took over the market.
There's no regulatory or similar barriers to entry. It's literally that they do a damn good job and the competition can't be as good.
You look at that and want to take a sledgehammer to a golden goose? I don't get these people
True: nvidia has been consistently investing for over a decade.
They saw there was nascent compute use of GPUs, using programmable shaders. They produced CUDA, made it accessible on every one of their GPUs (not just the high-markup professional products) and they put resources into it year after year after year.
Not just investing in the product, also the support tools (e.g. a full graphical profiler for your kernels) and training materials (e.g. providing free cloud GPU credits for Udacity courses) and libraries and open source contributions.
This is what it looks like when a company has a vision, plans beyond the next quarter, and makes long-term investments.
The better alternative is to root for AMD and others to develop their own products so that regardless of breaking NV up or not, there are alternative solutions for people to use. They all leapfrog each other with new releases now any way. Why put all your eggs into one basket.
We've rooted for that for years, but looking at what AMD does and doesn't do, I've lost hope for this. ĀMD don't seem to want to do what it takes; it's not that they're trying and failing, but they're simply not even committing to attempt to do the same things that nVidia does for their software infrastructure.
We are still early. I started my bet on Lisa Su around August of last year... she publicly doubled down on AI around October/November. Dec 6th, MI300x was announced.
Big ships take time to course correct. Look at their hiring for AI related positions and release schedule for ROCm. As well as multiple companies like mine springing up to purchase MI300x and satisfy rental demand.
It is only May. We didn't even receive our AIA's until April. Another company just announced their MI300x hardware server offering today.
I am going to ignore AMD till/if they get their shit together. They have lost any goodwill or trust many gpu generations ago. It is really up to them to make up for it.
George Hotz went down the AMD rabbit hole for a while and concluded that the driver software — more precisely the firmware which runs on the cards themselves — is so badly written that there's no hope of them becoming serious contenders in AI without some major changes in AMD's priorities.
I'm not defending their software. It does honestly have a ton of issues.
George Hotz tried to get a consumer card to work. He also refused my public invitations to have free time on my enterprise cards, calling me an AMD shill.
AMD listened and responded to him and gave him even the difficult things that he was demanding. He has the tools to make it work now and if he needs more, AMD already seems willing to give it. That is progress.
To simply throw out George as the be-all and end-all of a $245B company... frankly absurd.
The fact that consumer and "pro"(?) GPUs don't use (mostly) the same software is not confidence inspiring. It means that AMD's already apparently limited capacity for software development is stretched thinner than it otherwise would be.
Also, if the consumer GPUs are hopelessly broken but the enterprise GPUs are fine, that greatly limits the number of people that can contribute to making the AMD AI software ecosystem better. How much of the utility of the NVIDIA software ecosystem comes from gaming GPU owners tinkering in their free time? Or grad students doing small scale research?
I think these kinds of things are a big part of why NVIDIA's software is so much better than AMD right now.
that greatly limits the number of people that can contribute to making the AMD AI software ecosystem better
I’d say it simply dials it down to zero. No one’s gonna buy an enterprise AMD card for playing with AI, so no one’s gonna contribute to that either. As a local AI enthusiast, this “but he used consumer card” complaint makes no sense to me.
> No one’s gonna buy an enterprise AMD card for playing with AI
My hypothesis is that the buying mentality stems from the inability to rent. Hence, me opening up a rental business.
Today, you can buy 7900's and they work with ROCm. As George pointed out, there are some low level issues with them, that AMD is working with him to resolve. That doesn't mean they absolutely don't work.
Agreed that AMD needs to work on the developer flywheel. Again, not defending their software.
One way to improve the flywheel and make the ecosystem better, is to make their hardware available for rent. Something that previously was not available outside of hyperscalers and HPC.
> To simply throw out George as the be-all and end-all of a $245B company... frankly absurd.
I didn't do that, and I don't appreciate this misreading of my post. Please don't drag me into whatever drama is/was going on between you two.
The only point I was making was that George's experience with AMD products reflected poorly on AMD software engineering circa 2023. Whether George is ultimately successful in convincing AMD to publicly release what he needs is beside the point. Whether he is ultimately successful convincing their GPUs to perform his expectations is beside the point.
Clearly you've experienced some kind of personality clash and/or a battle of egos. I can't fault you for holding a low opinion of him as a result, but I'm unimpressed with personal beefs being used as evidence to impeach credibility.
My point is as I wrote in both posts. George was able to demonstrate evidence of poor engineering which "reflected poorly on AMD". From this I could form my own conclusion that AMD aren't in an engineering position to become "serious contenders in AI".
The poor software engineering evident on consumer cards is an indictment of AMD engineers, and the theoretical possibility for their enterprise products to have well engineered firmware wouldn't alleviate this indictment. If anything it makes AMD look insidious or incompetent.
Egohotz is brilliant in many ways, but taking him at his word when it comes to working with others has been a mistake since at least around 2010. This is well documented.
Who said anything about taking him at his word? Everything he has done regarding AMD GPUs has been in public. I'm sure there are plenty of valid criticisms one can make of his skills/strategy/attitude/approach, but accusing him of being generally untrustworthy in this endeavour is utterly nonsensical.
GPU compute is already broken up - there is a supply chain of other cooperating players that work together to deliver GPU compute to end users:
TSMC, SK hynix, Synopsys, cloud providers (Azure/Amazon etcetera), model providers (OpenAI/Anthropic etcetera).
Why single out NVidia in the chain? Plus the different critical parts of the chain are in different jurisdictions. Split up NVidia and somebody else will take over that spot in the ecosystem.
This interview with Synopsys is rather enlightening: https://www.acquired.fm/episodes/the-software-behind-silicon...
How does the profit currently get split between the different links? Profit is the forcing variable for market cap and profit is the indicator of advantage. Break up NVidia and where does the profit move?
What is needed are true NPUs as dedicated co-processors, especially for prosumer desktop systems (devs, other professionals, gamers). GPUs work in the enterprise, but they're a hassle to use for AI on the personal computing side of the market. Especially VRAM limitations, but also the lack of a standard open API other than Vulkan (again, using video stuff for AI).
Compared to CUDA, Vulkan is... not fun to code compute in! The serialization bridge and duplicating data structures and functions between CPU and GPU is tedious.
this is why people should better study neuroscience, psychology if they want to advance research in AI.
also things related to graph topology in neural networks maybe, but probably not related to artificial NN.
I was given this video, which I found was pretty interesting: https://www.youtube.com/watch?v=nkdZRBFtqSs (How Developers might stop worrying about AI taking software jobs and Learn to Profit from LLMs - YouTube)
> I doubt neuroscience will either, but I’m not as sure on that
The stuff on spiking networks and neuromorphic computing is definitely interesting and inspired by neuroscience, but it currently seems mostly like vaporware
The question is whether current AI technologies represent any progress towards a true human equivalent artificial general intelligence. Most likely not, but no one knows for sure. If the answer turns out to be no then real progress will likely require theoretical insights from psychology, neuroscience, and other fields.
Fwiw, I don’t think we’re any closer to general intelligence then we were 5 years ago.
Other than that, I agree, especially since you added “and other fields.”
Psychology might eventually give us a useful definition of “intelligence,” so that’d be something.
Obviously all research can influence other areas of research.
It's easy to overstate, but shouldn't be understated either with, as an example, solving problems with learning in AI providing insights into how dopamine works in brains.
There are obvious, huge differences between what goes on in a computer and what happens in a a brain. Neurons can't do back propagation is a glaring one. But they do do something that ends up being analogous to back propagation and you can't tell a priori whether some property of AI or neuroscience might be applicable to the other or not.
The best way to learn about AI isn't to learn neuroscience. it's to learn AI. But if I were an AI lab I'd still hire someone to read neuroscience papers and check to see whether they might have something useful in them.
There are loads of psychologists and neuroscientists today. Has any of them in the last few years produced anything advancing AI? The proof of the pudding is in the eating so if they have at a higher rate than just straight CS/Mathematics and related then there’s probably some truth to it.
tangential: When @sama talks about "Universal Basic Compute" (UBC) as a substitute for Universal Basic Income, obviously he means GPU, right? Who's going to benefit from such policies? Only nvidia? It just seems such a dystopian future to live in: imagine you can sell your UBC to others who know better how to use it, or you can use it to mine bitcoin or whatever. But all the compute is actually created by one company.
There are many reasons to hate nvidia, but honestly if this UBC policy is even remotely being considered in some circles, I'd join Linus Torvalds and say "nvidia, fuck you".
Him saying this always puts me off. Gives hard old sales-guy vibes. I really wonder who/which demographic is influenced in nvidias favor by this rethoric.
From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this."
The hardware needs of AI are starting to focus. GPUs, after all, were designed for an entirely different job. They're used for AI because they have good matrix multiply hardware. "AI GPUs" get to leave out some of the stuff in a real GPU (does an H100 even have texture fill units?). Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit? That will settle out at some point. This paper indicates that hardware that likes 16x16 tiles makes a lot of sense. It's certainly possible to build such hardware. Someone reading this is probably writing it in VHDL right now, or will be soon.
Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.