Advances in semiconductors are feeding the AI boom

highfrequency · 2024-03-28T16:16:18

Wild that the human brain can squeeze in 100 trillion synapses (very roughly analogous to model parameters / transistors) in a 3lb piece of meat that draws 20 Watts. The power efficiency difference may be explainable by the much slower frequency of brain computation (200 Hz vs. 2GHz).

My impression is that the main obstacle to achieving a comparable volumetric density is that we haven't cracked 3d stacking of integrated circuits yet. Very exciting to see TSMC making inroads here:

> Recent advances have shown HBM test structures with 12 layers of chips stacked using hybrid bonding, a copper-to-copper connection with a higher density than solder bumps can provide. Bonded at low temperature on top of a larger base logic chip, this memory system has a total thickness of just 600 µm...We’ll need to link all these chiplets together in a 3D stack, but fortunately, industry has been able to rapidly scale down the pitch of vertical interconnects, increasing the density of connections. And there is plenty of room for more. We see no reason why the interconnect density can’t grow by an order of magnitude, and even beyond.

It's hard to imagine not getting unbelievable results when in 10-30 years we have GPUs with a comparable number of transistors to brain synapses that support computation speed 10,000x faster than the brain. What a thing to witness!

layer8 · 2024-03-28T16:34:35

Ignoring for the moment that transistors and synapses are very different in their function, the current in a CPU transistor is in the milliampere range, whereas in the ion channels of a synapse it is in the picoampere range. The voltage differs by roughly a factor of ten. So the wattage differs by a factor of 10^10.

One important reason for the difference in current is that transistors need to reliably switch between two reliably distinguishable states, which requires a comparatively high current, whereas synapses are very analog in nature. It may not be possible to reach the brain’s efficiency with deterministic binary logic.

fsckboy · 2024-03-28T18:04:44

>the current in a CPU transistor is in the milliampere range

? you sure about that? in a single transistor? over what time period, more than nanoseconds? milliamps is huge, and there are millions of transistors on a single chip these days, and with voltage drops of ... 3V? .7V? you're talking major power. FETs should be operating on field more than flow, though there is some capacitive charge/discharge.

Ductapemaster · 2024-03-28T18:17:55

Single transistors in modern processes switch currents orders of magnitude lower than milliamps. More like micro- to picoamps. There's leakage to account for too as features get smaller and smaller due to tunneling and other effects, but still in aggregate the current per transistor is tiny.

Also the transistors are working at 1V or lower, but as you say they are FETs and don't have the same Vbe drop as a BJT.

layer8 · 2024-03-28T19:29:21

You are right, I mixed this up. If you take a CPU running at 100 W with 10 billion transistors (not quite realistically assumed to all be wired in parallel) at 1 V, you would get an average of 0.01 microamps. So the factor would reduce to roughly 10^5.

nuancebydefault · 2024-03-28T21:51:57

Wait a minute, a lot of those transistors are switching the same currents since they are in series. Also, FETs only draw most current while switching, so in between switches there's almost no flow of electrons. So in fact you cannot calculate things that way.

layer8 · 2024-03-28T22:21:06

Yes, as I said the parallel assumption is not quite realistic, and the number is an average, covering all states a transistor may be in. So it amounts to a rough lower bound for when a transistor is switching.

TRDRVR · 2024-03-28T16:45:58

Thank you for this - the name neural networks has made a whole generation of people forget that they have an endocrine system.

We know things like sleep, hunger, fear, and stress all impact how we think, yet people want to still build this mental model that synapses are just dot products that either reach an activation threshold or don't.

cosmojg · 2024-03-28T20:04:03

Fortunately for academics looking for a new start in industry, this widespread misunderstanding has made it only far too easy to transition from a slow-paced career in computational neuroscience to an overwhelmingly lucrative one in machine learning!

goatlover · 2024-03-28T22:18:15

There have been people on HN arguing that the human brain is a biological LLM, because they can't think of any other way it could work, as if we evolved to generate the next token, instead of fitness as organisms in the real world. Where things like eating, sleeping, shelter, avoiding danger, social bonds, reproduction and child rearing are important. Things that require a body.

Terr_ · 2024-03-29T04:30:06

It's also frustrating because LLMs aren't even the only kind of AI/ML out there, they're just the kind currently getting investment and headlines.

kungito · 2024-03-28T22:41:04

I'm one of those people. To me those things only sounded like a different prompt. Priorities set for the llm

goatlover · 2024-03-28T23:10:20

Isn’t that taken the analogy too literally? You’re saying nature is promoting humans to generate the next token to be outputted? What about all the other organisms that don’t have language? How do you distinguish nature prompts from nature training datasets? What makes you think nature is tokenized? What makes you think language generation is fundamental to biology?

TRDRVR · 2024-03-28T23:00:01

Here's the hubris of thinking that way:

I would imagine the baseline assumption of your thinking is that things like sleep and emotions are a 'bug' in terms of cognition (or at the very least, 'prompts' that are optional).

Said differently, the assumption is that with the right engineer, you could reach human-parity cognition with a model that doesn't sleep or feel emotions (after all what's the point of an LLM if it gets tired and doesn't want to answer your questions sometimes? Or even worse knowingly deceives you because it is mad at you or prejudiced against you).

The problem with that assumption is that as far as we can tell, every being with even the slightest amount of cognition sleeps in some form and has something akin to emotional states. As far as we can prove, sleep and emotions are necessary preconditions to cognition.

A worldview where the 'good' parts of the brain (reasoning and logic) are replicated in LLM but the 'bad' parts (sleep, hunger, emotions, etc.) are not is likely an incomplete model.

kelseyfrog · 2024-03-29T03:32:58

Do airplanes need sleep because they fly like birds who also require sleep?

TRDRVR · 2024-03-29T21:25:58

Ah a very fun 'snippy' question that just proves my point further. Thank you.

No airplanes do not sleep. That's part of why their flying is fundamentally different than birds'.

You'll likely also notice that birds flap their wings while planes use jet engines and fixed wings.

My entire point is that it is foolish to imagine airplanes as mechanical birds, since they are in fact completely different and require their own mental models to understand.

This is analogous to LLMs. They do something completely different than what our brains do and require their own mental models in order to understand them completely.

kelseyfrog · 2024-03-30T00:30:45

I'm reluctant to ask, but how do ornithopters fit into a sleep paradigm?

TRDRVR · 2024-04-02T18:05:39

Great follow up!

Ornithopters are designed by humans who sleep - the complex computers needed to make them work replicate things humans told them to do, right?

It is a very incomplete model of an ornithopter to not include the human.

TRDRVR · 2024-03-29T21:38:20

Here, it's actually fun to respond to your comment in another way, so let's try this out:

Yes, sleep is in fact a prerequisite to planes flying. We have very strict laws about it actually. Most planes are only able to fly because a human (who does sleep) is piloting it.

The drones and other vehicles that can fly without pilots were still programmed by a person (who also needed sleep) FWIW.

sorokod · 2024-03-29T11:04:17

They do need scheduled maintenance.

goatlover · 2024-03-29T04:53:39

Birds flap their wings and maneuver differently. They don't fly the same way.

__loam · 2024-03-28T18:27:44

People will spout off about how machine learning is based on the brain while having no idea how the brain works.

gary_0 · 2024-03-28T20:40:28

It is based on the brain, but only in the loosest possible terms; ML is a cargo cult of biology. It's kind of surprising that it works at all.

Borg3 · 2024-03-28T21:35:39

It works because well, its actually pretty primitive at its core. Whole learning process is actually pretty brutal. Doing millions of interations w/ random (and semi-random) adjustments.

Terr_ · 2024-03-29T04:32:01

I think I've fallen into the "it's just a very fancy kind of lossy compression" camp.

eli_gottlieb · 2024-03-29T02:58:06

Honestly once you understand maximum-likelihood estimation, empirical risk minimization, automatic differentiation, and stochastic gradient descent, it's not that much of a surprise it works.

philipkglass · 2024-03-28T18:01:34

Ignoring for the moment that transistors and synapses are very different in their function, the current in a CPU transistor is in the milliampere range...

That seems implausible. Apple's M2 has 20 billion transistors and draws 15 watts at full power [1]. Even assuming that 90% of those transistors are for cache and not logic, that would still be 2 billion logic transistors * 1 milliampere = 2 million amperes at full power. That would imply a voltage of 7.5 microvolts, which is far too low for silicon transistors.

[1] https://www.anandtech.com/show/17431/apple-announces-m2-soc-...

omikun · 2024-03-28T19:01:05

A single precision flop is in the order of pJ. [1] A transistor would be much less.

[1] https://arxiv.org/pdf/1809.09206.pdf

whiplash451 · 2024-03-28T20:14:12

How do you get to 10^10? I might be missing a fundamental of physics here (asking genuinely).

CuriouslyC · 2024-03-28T20:44:28

The more of our logic we can implement with addition, the more can be offloaded to noisy analog systems with approximate computing. It would be funny if model temperature stopped being metaphorical.

eli_gottlieb · 2024-03-29T02:52:05

Synapses are very analog, but then the neuronal soma (cell body) and axon are pretty pulsatile again. A spike is fired or it isn't!

jvanderbot · 2024-03-28T18:29:05

A simple discretization of the various levels of signal at each input/output, a discretization to handle time-of-propagation (which is almost surely part of the computation just because it _can be_ and nature probably hijacks all mechanisms), and a further discretization to handle the various serum levels in the brain, which are either inputs, outputs, or probably both.

Just add a factor 2^D transistors for each original "brain transistor" and re-run your hardware. Hope field effects don't count, and cross your fingers that neurons are idempotent!

Easy! /s

Modelling an analog system in digital will always have a combinatorial curse of dimensionality. Modelling a biological system is so insanely complex I can't even begin to think about it.

npalli · 2024-03-28T16:31:43

No neurons are the equivalent of transistors. Synapses are the equivalent of connections between transistors. The total neurons are about 100 Billion but the connections are 100 Trillion. A neuron can have up to 10,000 synapses, while a transistor can have on average only about 3. Impressive nonetheless on power efficiency.

ksec · 2024-03-29T03:39:23

So the actual equation should be 100 billion times 100 Trillion times clock speed? ( I have zero ideas about how our brains works )

Even we have 10 times higher clock speed, a trillion transistor GPU is still very slow comparatively speaking ?

Surely memory has to come in somewhere?

highfrequency · 2024-03-28T16:35:10

# synapses should be analogous to # model parameters no? And # model parameters should be linear in # transistors.

retrofrost · 2024-03-28T18:23:30

We can not even get close to saying our current networks can be even close to synapses in performance or functional because architecturally we still use feedforward networks no recursion, no timing elements, very static connections. Transitors will definitely have some advantages in terms of being able to synchronize information and steps to an infinitely better degree than biological neurons, but as long as we stick with transformers it's the equivalent of trying to get to space by stacking sand, could you get there eventually? Yes, but there's better ways.

phkahler · 2024-03-28T16:46:44

>> # synapses should be analogous to # model parameters no?

I think they're equivalent to a parameter AND the multiplier. Or in analog terms they'd just be a resistor whose value can be changed. Digital stuff is not a good fit for this.

0xcde4c3db · 2024-03-28T17:38:03

> Or in analog terms they'd just be a resistor whose value can be changed

For what it's worth, that's actually a thing (ReRAM/memristors), but I think it got put on the back burner because it requires novel materials and nobody figured out how to cost-effectively scale up the fabrication versus scaling up flash memory. I saw some mention recently that advances in perovskite materials (a big deal lately due to potential solar applications) might revive the concept.

HarHarVeryFunny · 2024-03-28T18:41:11

> The power efficiency difference may be explainable by the much slower frequency of brain computation (200 Hz vs. 2GHz).

Partly, but also because the brain has an asynchronous data-flow design, while the GPU is synchronous, and as you say clocked at a very high frequency.

In a clocked design the clock signal needs to be routed to every element on the chip which requires a lot of power, the more so the higher the frequency is. It's a bit like the amount of energy used doing "battle ropes" at the gym. The heavier the ropes (cf more gates the clock is connected to), the more power it takes to move them, and the faster you want to move them (cf faster clock frequency) the more power it takes.

In a data-flow design, like the brain, there is no clock. Each neuron fires, or not, independent of what other neurons are doing, based on their own individual inputs. If the inputs are changing (i.e. receiving signal spikes from attached neurons), then at some threshold of spike accumulation the neuron will fire (expending energy). If the inputs are not changing, or at a level below threshold, then the neuron will not fire.

To consider the difference, imagine our visual cortex if we're looking at a seagull flying across a blue sky. The seagull represents a tiny part of the visual field, and is the only part that is moving/changing, so there are only a few neurons who's inputs are changing and which themselves will therefore fire and expend energy. The blue sky comprising the rest of the visual field is not changing and we therefore don't expend any energy reprocessing it over and over.

In contrast, if you fed a video (frame by frame) of that same visual scene into a CNN being processed on a GPU, then it does not distinguish between what is changing or not, so 95% of the energy processing each frame will be wasted, and this will be repeated frame by frame as long as we're looking at that scene!

taktoa · 2024-03-28T19:26:02

> In a clocked design the clock signal needs to be routed to every element on the chip which requires a lot of power, the more so the higher the frequency is.

Clock only needs to be distributed to sequential components like flip flops or SRAMs. The number of clock distribution wire-millimeters in typical chip is dwarfed by the number of data wire-millimeters, and if a neural network is well trained and quantized activations should be random, so number of transitions per clock should be 0.5 (as opposed to 1 for clock wires), meaning that power can't be dominated by clock. The flops that prevent clock skew are a small % of area, so I don't think those can tip the scales either. On the other hand, in asynchronous digital logic you need to have valid bit calculation on every single piece of logic, which seems like a pretty huge overhead to me.

HarHarVeryFunny · 2024-03-28T19:58:26

There's obvious potential savings in not wasting FLOPs recalculating things unnecessarily, but I'm not sure how much of that could be realized by just building a data-flow digital GPU. The only attempt at a data-flow digital processor I'm aware of was AMULET (by ARM designer Steve Furber), which was not very successful.

There's more promise in analog chip designs, such as here:

https://spectrum.ieee.org/low-power-ai-spiking-neural-net

Or otherwise smarter architectures (software only or S/W+H/W) that design out the unnecessary calculations.

It's interesting to note how extraordinarily wasteful transformer-based LLMs are too. The transformer was designed part inspired by linguistics and part based on the parallel hardware (GPU's etc) available to run it on. Language mostly has only local sentence structure dependencies, yet transformer's self-attention mechanism has every word in a sentence paying attention to every other word (to some learned degree)! Turns out it's better to be dumb and fast than smart, although I expect future architectures will be much more efficient.

mrtksn · 2024-03-28T16:37:32

IMHO the breakthrough will come with an analog computer.

The current systems simulate stuff through computation using switches.

It’s like real sand vs sand simulator in the browser. One spins your fans and drains your battery when showing you 1000 particles acting like sand, the other just obeys laws of physics locally per particle and can do millions of particles much more accurately and with very slight increase in temperature.

Of course, analog computations are much less controllable but in this case that’s not a deal breaker.

marcosdumay · 2024-03-28T16:34:38

On the power usage, the difference is that those synapses are almost always in stand-by. The equivalent would be a CMOS circuit with a clock of minutes.

On the complexity, AFAIK a synapse is way more complex than a transistor. Larger too, if you include its share of the neuron's volume. And yes, the count difference is due to the 3D packing.

sroussey · 2024-03-28T17:48:59

The brain’s 3d packing is via folding. Maybe something similar would be better than just stacking.

vasco · 2024-03-28T17:16:16

We also lose a lot when building computers due to the fact we have to convert the analog world into digital representations. A neural analog computer would be more efficient I think, and due to the non-deterministic nature of AI would probably suit the task as well.

barelyauser · 2024-03-28T17:55:57

Non-deterministic means random. AI or natural I is not random. Analog suffers immensely from noise and it is the reason the brain has such a large number of neurons, part to deal with noise and part to deal with losing some neurons along the way.

doug_durham · 2024-03-28T19:16:57

Non-deterministic doesn't mean random. Random means random. Non-deterministic means that specific inputs don't generate the same outputs. It says nothing about the distribution of the output values. A chaotic system isn't random, its just non-deterministic.

dragonwriter · 2024-03-28T19:56:13

“Nondeternimism” gets used jn a variety of ways both from differing context and conflicting uses in the same context, but chaotic systems are fully deterministic in the most common relevant sense, but highly sensitive to inputs, resulting in even very small uncertainty in inputs to render them largely unpredictable.

vasco · 2024-03-29T07:15:24

Analog suffers from noise? What do you think you get when you convert analog to digital? Way more loss of signal than whatever inbuilt noise the analog signal had in the first place.

sroussey · 2024-03-28T17:47:09

There are nueromorphic companies like https://rain.ai

gricardo99 · 2024-03-28T17:21:55

interesting point.

i know of at least one startup working with that concept[1].

Im sure there are others.

1 - https://www.extropic.ai/

0x000xca0xfe · 2024-03-28T23:18:36

It has some interesting implications: If we manage to replicate a human-like brain in silicon, it could think multiple million times faster than the biological equivalent. So even if it needs a lot of energy to run, it could "out-think" us by orders of magnitude and would have crazy reaction times. Just imagine a sentient being with the human mental equivalent of millions of years in one year. Crazy.

But we know very little about how biological brains actually work and very few connectomes have been fully mapped as of yet. We still cannot fully explain how C. elegans with 302 neurons "thinks".

Also every neuron contains the entire genome, which is hundreds of millions to billions of base pairs depending on the animal and we also don't really know how important it is to the function of the brain.

So the "state" of biological brains is potentially utterly gigantic even for simple animals.

fiftyfifty · 2024-03-28T17:27:12

By some accounts it takes a pretty sizable neural network to simulate a single neuron:

https://www.sciencedirect.com/science/article/pii/S089662732...

So we are going to need a lot of computational power to approximate what’s going on in an entire human brain.

orbital-decay · 2024-03-28T22:14:25

Apples to oranges. Gate count indicates nothing when the architectures are nothing alike.

Brain is a spiking network with mutable connectivity, mostly asynchronous. Only the active path is spending energy at a single moment in time, and "compute" is tightly coupled with memory to the point of being indistinguishable. No need to move data anywhere.

In contrast, GPUs/TPUs are clocked and run fully connected networks, they have to iterate over humongous data arrays every time. Memory is decoupled from compute due to the semiconductor process differences between the two. As a result, they waste a huge amount of energy just moving data back and forth.

Fundamental advancements in SNNs are also required, it's not just about the transistors.

zer00eyz · 2024-03-28T16:41:46

>> 100 trillion synapses (very roughly analogous to transistors?)

Not even remotely comparable

* its unlikely synapses are binary. Candidly they probably serve more than one purpose.

* transistor count is a bad proxy for other reasons. A pipeline to do floats are not going to be useful for fetch from memory. "Where" the density lies is important.

* Power: On this front transistors are a joke.

* The brain is clockless, and analog... frequency is an interesting metric

Binary systems are going to be bad at simulating complex processes. LLM's are a simulation of intelligence, like forecasting is a simulation of weather. Lorenz shows us why simulation of weather has limits, there isnt some magical math that will change those rules for ML to make the leap to "AGI"

luyu_wu · 2024-03-28T16:47:34

Transistor power is really not a joke. Synapses would take far far FAR more power at close to similar frequencies. Biological neurons are incredibly inefficient.

goatlover · 2024-03-28T22:24:36

Biology is incredibly efficient at what it does well though. Thus only 20 watts of brain energy to coordinate everything we do. We didn't evolve to be mentats.

Traubenfuchs · 2024-03-28T22:53:24

Neurons and synapses have more than 100 different neurotransmitters and their many receptors, there is reuptake and destructive encyme activity and they are connected up to many thousands of their peer. Every single neuron is a dizzyingly complex machine, employing countless sub machines and processes.

You can not reasonably compare this to model parameters or transistors at all.

ozim · 2024-03-28T22:54:54

Nature was “building” this stuff for eons. I feel pretty good about our progress in still less than a 100 years.

wslh · 2024-03-28T21:50:29

Neurons are not just linear algebra pieces by the way. This is why there is bibliography that talks about the complexity of a single neuron. So, comparing at the unit level at this point is apples vs. oranges.

But, yes, the brain continues to be a surprising machine and ML accomplishements are amazing for that machine.

user90131313 · 2024-03-28T16:24:40

I think biggest difference is millions of years evolutionary development. That's a lot of time difference.

agumonkey · 2024-03-30T16:20:54

Hmm maybe the frequency metric is off. If neurons can derive output from potentially N overlapping / multiplexed inputs .. it's as if the clock was N times higher too.

londons_explore · 2024-03-28T16:30:16

> is it just that we haven't cracked 3-d stacking of integrated circuits yet?

Yes. If we could stack transistors in the Z dimension as closely as we do in X and Y, we'd easily exceed the brains density.

pnjunction · 2024-03-28T18:01:35

I wonder if this trajectory will only lead to reinventing the biological brain. It is hard to imagine the emergence of consciousness, as we know it, on a fundamentally deterministic system.

ordu · 2024-03-28T21:31:04

> The power efficiency difference may be explainable by the much slower frequency of brain computation (200 Hz vs. 2GHz).

Or by a more static design. A GPU can't do a thing without all the weights and shaders. There are benefits of this, you can easily swap one model for another. Human mind from the other hand is not reprogrammable. It can learn new tricks, but you cannot extract a firmware from one person and to upload it to another person.

Just imagine if every logical neuron of AI was a real thing, with physical connections to other neurons as inputs. No more need to have a high throughput memory, no more need to have compute units with gigaherz frequency.

haltIncomplete · 2024-03-28T18:38:15

Why are static 3D cells going to get us there when other ideas have not? Is it needed to replicate “arbitrary” academic ideas of consciousness (despite our best efforts our models are always approximation) to make a useful machine?

“Living things” are not static designs off the drafters table. They’ll never intelligent from their own curiosity, but from ours and the rules we embed. No matter how hard we push puerile hallucinations embedded by Star Trek. It’s still a computer and human agency does not have to bend to it.

jameshart · 2024-03-28T16:01:03

For some sense of how far out/close 1 trillion transistors in one GPU is:

NVIDIA just announced Blackwell which gets to 208bn transistors on a chip by stitching two dies together into a single GPU. https://www.nvidia.com/en-us/data-center/technologies/blackw...

They’re sticking two of them in a board with a Grace CPU in between, then linking 36 of those those boards together in racks with NVLink switches that offer “130TB/s of GPU bandwidth in one 72-GPU NVLink domain”.

In terms of marketing, NVidia calls one of those racks a GB200 NVL72 “super GPU”.

So on one level NVIDIA would say they already have a GPU with ‘trillions’ of transistors.

sylware · 2024-03-28T16:10:08

GPUs can still scale up by a LOT. The real limiting parameters are power consumption (which includes cooling), bus speed, production level of foundries.

Think super-[cross-fire/sli].

Economics will probably forbid that. This is a virtual limit which factors in the previous physical limits... in theory.

HarHarVeryFunny · 2024-03-28T16:07:07

We already have a 4 trillion transistor "GPU" in the Cerebras WSE-3 (wafer-scale engine), used in Cerebras' data centers.

https://www.youtube.com/watch?v=f4Dly8I8lMY

DesiLurker · 2024-03-28T16:14:10

how many regular sized GPU (say 4080 or nvidia L4) dies can be cut out from a full-sized wafer? I suppose thats what OP means by reaching integration density of 1T GPU.

moralestapia · 2024-03-28T16:28:52

I think the whole premise of TFA is flawed, as there already chips with way more than a trillion transistors (as GP points out).

Arguing about what is the size limit to consider something a GPU or not is a bit like bikeshedding.

As to why wouldn't a supercomputer be considered for this? Because it's not a single chip.

denimnerd42 · 2024-03-28T16:26:06

here's some discussion about yields for H100 per wafer. I'd assume a 4080 is smaller? regardless, the calculator is supplied.

https://news.ycombinator.com/item?id=38588876

ksec · 2024-03-29T04:33:46

At close to reticle limit of around ~810mm2, you could fit around 65 chips.

IshKebab · 2024-03-28T16:57:51

It's somewhere around 20 if you go as large as possible IIRC.

IshKebab · 2024-03-28T16:58:49

Yeah that doesn't really count. It's the equivalent to like 20 GPUs and costs 200x as much.

HarHarVeryFunny · 2024-03-29T16:29:18

Well, yeah, that's what a few trillion transistors look like with today's tech. No doubt it will get smaller/cheaper in the future (although EUV is expensive, so maybe prices won't keep dropping as much as in the past). The point is that even with today's tech a (4) trillion transistor chip can already be built.

edward28 · 2024-03-28T21:59:16

It's meant to compete with nvidia's DGX systems with 8 GPUs per node.

IshKebab · 2024-03-29T07:25:40

Yes exactly. It's not comparable to one GPU.

ViktorRay · 2024-03-28T15:51:35

This article has two authors.

One author is chairman of TSMC.

The other author is Chief Scientist of TSMC.

This is important to note because they clearly know some stuff and we should listen.

bigbillheck · 2024-03-28T16:04:20

Two credited authors.

ramshanker · 2024-03-28T15:48:54

When the limits of digital reaches the boundary of physics, Analogue is going make a comeback. Human brain feels nearer to Analogue than Digital. I will be surprised to reach AGI without nearing the Order-of-Magnitude of brain processing.

We need that ONE paper on analogue to end this quest of trillions and counting transistors.

knodi123 · 2024-03-28T16:04:45

> I will be surprised to reach AGI without nearing the Order-of-Magnitude of brain processing.

I have some theories that this isn't necessary. 1.) Just because the brain is a general-purpose machine great at doing lots of things, doesn't mean it's great at each of those things. Like when two people are playing catch, and one of them sees the first fragments of a parabola and estimates where the ball is going to land- a computer can calculate that way more efficiently than a mind, despite the fact that both are quick enough to get the job done. 2.) While the brain is great at, say, putting names to faces... a good CV machine can do the job almost as well, and can annotate a video stream in real-time.

Combining 1.) the fact that some problems are much simpler to solve with classical algorithms instead of neural networking, and 2.) that many brain tasks can be farmed out to a coprocessor/service, my hypothesis is that the number of neurons/resources required to do the "secret sauce" part of agi could be greatly reduced.

Galaxeblaffer · 2024-03-28T16:19:11

I'm also in the camp that believes that we won't reach agi without significantly more compute. I think that consciousness is an emergent property, just like i think life itself is an emergent property, Both need a certain set elements/systems to work. The secret sauce may look simple when we recreate the right conditions, but it's not going to be something that makes it possible without the right hardware so to speak.

knodi123 · 2024-03-28T21:36:52

But, take for instance, sight. Sighted people use a large percentage of their brain to process visual input. People born congenitally blind are no smarter or dumber for their brains not having to process that input- so clearly that's not the secret sauce.

I'm not convinced consciousness is emergent, I don't really have an opinion on _that_- but I'm > 50% convinced that consciousness itself doesn't require a neural network as large as a human brain's.

bee_rider · 2024-03-28T16:08:58

Analogue circuits can be a real pain in the butt, imagine if integer overflows destroyed the whole ALU, haha.

Transistors are already much smaller than neurons. And of course the brain doesn’t have a clock. And neurons have more complex behavior than single transistors… The whole system is just very different. So, this doesn’t seem like a strategy to get past a boundary, it is more like a suggestion that we give up on the current path and go in a radically different direction. It… isn’t impossible but it seems like a wild change for the field.

If we want something post-cmos, potentially radically more efficient, but still familiar in the sense that it produces digital logic, quantum dot cellular automata with Bennett clocking seems more promising IMO.

marcosdumay · 2024-03-28T16:37:26

I doubt it. Digital just scales better.

Our brain has a pretty bounded need of scaling, but once we create some computer equivalent, it would be very counterproductive to make it useless for larger problems for a small gain on smaller ones.

smallmancontrov · 2024-03-28T17:12:37

> Digital just scales better.

Yes!

> Our brain has a pretty bounded need of scaling

No!

Over aeons our brains scaled from several neurons to 100 billion neurons, each with 1000 synapses. They were able to do it because our brains are digital. They lean on their digital nature even more than computer chips do.

Action potentials are so digital it hurts. They aren't just quantized in level, but in the entire shape of the waveform across several milliseconds. Just as in computer chips, this suppresses perturbations. As long as higher level computation only depends on presence/absence of action potentials and timing, it inherits this robustness and allows scale. Rather than errors accumulating and preventing integration beyond a certain threshold, error resilience scales alongside computation. Every neuron "refreshes the signal," allowing arbitrary integration complexity at any scale, even in the face of messy biology problems along the way. Just like every transistor (or at least logic gate) "refreshes the signal" so that you can stack billions on a chip and quadrillions in sequential computation, even though each transistor is imperfect.

Digital computation is the way. Always has been, always will be.

bigyikes · 2024-03-28T15:54:50

Computers are going to increase in size as they consume more and more power, requiring increasingly elaborate cooling systems.

Like how the transistor made the big and hot vacuum tubes obsolete, maybe we’ll see some analog breakthrough do the same thing to transistors, at least for AI.

I doubt there is a world where we use analog for general purpose computing, but it seems perfect for messy, probabilistic processes like thinking.

enlyth · 2024-03-28T16:00:53

What's amazing is that the human brain does it all on the equivalent of like 20 watts of power. That's basically a couple of LED light bulbs.

passion__desire · 2024-03-28T16:07:32

Is there a comparison of power efficiency of human brain doing 50 digits multiplication vs a multiplier circuit doing it?

gamacodre · 2024-03-28T16:34:11

I think the problem here would be figuring out how much of the brain's power draw to attribute to the multiplication. A brain is more akin to a motherboard than a single CPU, with all kinds of I/O, internal regulation, and other ancillary stuff going on all the time.

passion__desire · 2024-03-28T16:35:42

Is the issue then we haven't discovered the magical algorithm run by our brain? If we discover it, then digital circuits will handsomely beat brain.

gamacodre · 2024-03-28T19:35:41

We can surely build more efficient and capable hardware than our current evolved wetware, since all of the details of how to build it are generally externalized. If the chips had to fab themselves, it would be a different story.

The software is a different story. Sure, the brain does all sorts of things that aren't necessary for $TASK, but we aren't necessarily going to be able to correctly identify which are which. Is your inner experience of your arm motion needed to fully parse the meaning in "raise a glass to toast the bride and groom", or respond meaningfully to someone who says that? Or perhaps it doesn't really matter - language is already a decent tool for bridging disjoint creature realities, maybe it'll stretch to synthetic consciousness too.

passion__desire · 2024-03-28T21:17:09

All of computation is realised by very few arithmetic operations. Then test energy efficiency of wetware and hardware on those operations. Then any difference can be attributed to algorithms.

mwbajor · 2024-03-28T16:12:55

I work in analog,

1) Noise is an issue as the system gets complex. You can't get away with counting to 1 anymore, all those levels in between matter. 2) Its hard to make an analog computer reconfigurable. 3) Analog computers exist commercially believe it or not, but for niche applications and essentially as coprocessors.

jameshart · 2024-03-28T17:53:24

Quantization of parameters in neural networks is roughly analogous to introducing noise into analog signals. We’ve got good evidence that these architectures are robust to quantization - which implies they could be implemented over noisy analog signals.

Not sure who’s working on that but I can’t believe it’s not being examined.

Teever · 2024-03-28T15:54:50

Won't that mean that we just change the quest from higher and higher density of digital elements to higher and higher density of analog elements?

Like people weren't trying to make computers out of bigger and bigger tubes before the transistor, they were trying to make them out of smaller and smaller ones.

floxy · 2024-03-28T16:19:43

>We need that ONE paper on analogue

https://www.analog.com/en/resources/analog-dialogue/articles...

jzombie · 2024-03-28T16:04:26

I may have misinterpreted this comment by thinking you meant that as we squeeze more and more transistors into a small amount of space, the resulting waveforms would start to resemble analog more so than digital.

smallmancontrov · 2024-03-28T15:59:33

What? Action potentials are extremely digital, in that small perturbations are suppressed so as to encode information in a higher level conceptual state (the "digit" / "bit" is the presence or absence of an AP) to obtain robustness and integration scale.

phkahler · 2024-03-28T16:54:31

If we are just after AI now we should drop the GPU concept and call it what it is - Matrix multipliers. With that framing, we can move to in-memory compute so the data doesn't have to be moving around so much. Memory chips could have lines of MAC units at some intervals and something to compute the nonlinear functions after summing. Fixed sizes could be implemented in hardware and software would spread larger computations over a number of tiles. If this were standardized we might see it end up as a modest premium price on memory chips. nVidia step aside it's going to be Micron, Hynix and friends revolutionizing AI.

tzm · 2024-03-28T15:38:21

Cerebras WSE-3 contains 4 trillion transistors and 8 exaflops per sec, 20 PB bandwidth. 62 times the cores of an H100.. 900,000. I wonder if the WSE-3 can compete on price / performance though. Interesting times!

jsheard · 2024-03-28T15:51:27

Is anyone actually using those WSEs in anger yet? They're on their third generation now, but as far as I can tell the discussion of each generation consists of "Cerebras announces new giant chip" and then radio silence until they announce the next giant chip.

rthnbgrredf · 2024-03-28T16:17:14

Problem is Software. You can put out a XYZ trillion monster chip that beats anything hardware wise, but it is going nowhere if you don't have the tooling and massive community (like Nvidia has) to actually do some real A.I. stuff.

IshKebab · 2024-03-28T17:02:00

Unlikely. They cost so much that nobody is going to do research on them - at best it's porting existing models. And they're so different to GPUs that the porting effort is going to be enormous.

They also suffer from the global optimisation problem for layout of calculations so compile time is going to be insane.

Their WSE technology is also already obsolete - Tesla's chip does it in a much more logical and cost effective way.

JonChesterfield · 2024-03-28T16:05:53

They sold some. Not strictly speaking the same as using any but there's a decent chance some code is running on the machines.

shrubble · 2024-03-28T17:37:28

The Cerebras-2 is at the Pittsburgh Supercomputing Center. Not sure if they ordered a 3.

monocasa · 2024-03-28T15:52:22

> 62 times the cores of an H100.. 900,000.

More than that arguably. CUDA cores are more like SIMD lanes than CPU cores like cerebras's usage of 'core'. Since they have 4 wide tensor ops on cerebras, there's arguably 3.6M CUDA equivalent cores.

xcv123 · 2024-03-28T23:09:09

> 8 exaflops per sec

Your number is off by 64x.

It can do 125 petaflops at FP16

https://www.tomshardware.com/tech-industry/artificial-intell...

AnimalMuppet · 2024-03-28T15:59:28

9 trillion flops per core? That's... mind-boggling. Is that real?

And, 9 trillion flops per core in 4.4 million transistors per core. That sounds a bit too good to be true.

xcv123 · 2024-03-28T23:06:33

No the real performance is 125 petaflops at FP16. That is 138 gigaflops per core.

You would need 64 of these to get 8 exaflops.

https://www.tomshardware.com/tech-industry/artificial-intell...

winwang · 2024-03-28T15:43:11

How's the single core-to-core bandwidth?

GaggiX · 2024-03-28T15:54:21

Yeah I also think that the impact of better hardware generates a visible lead in quality in the models that are released, for example: companies like OpenAI have had access to large quantities of H100 for a few months now and Sora is being presented, something I would not have believed a year ago, I would also believe that Claude 3 models were trained on H100, DBRX was trained on 12T tokens, a big difference compared to the 300B for the original GPT-3, the new NovelAI image generation was trained on H100 and compared to the previous model is like day and night. It seems to create a generational jump.

declaredapple · 2024-03-28T16:09:41

> companies like OpenAI have had access to large quantities of H100 for a few months now and Sora is being presented

From what I could tell from Nvidia's recent presentation, Nvidia works directly with OpenAI to test their next gen hardware. IIRC they had some slides showing the throughput comparisons with Hopper and Blackwell, suggesting they used OpenAI's workload for testing.

H100's have been generally available (not a long waitlist) for only several months, but all the big players had them already 1 year ago.

I agree with you, but I think you might be 1 generation behind.

> OpenAI used H100’s predecessor — NVIDIA A100 GPUs — to train and run ChatGPT, an AI system optimized for dialogue, which has been used by hundreds of millions of people worldwide in record time. OpenAI will be using H100 on its Azure supercomputer to power its continuing AI research.

March 21, 2023 https://nvidianews.nvidia.com/news/nvidia-hopper-gpus-expand...

GaggiX · 2024-03-28T16:19:50

Very interesting, I guess it does make sense that GPT-4 was also trained on the Hopper architecture.

kazinator · 2024-03-28T19:18:27

Hardware, or at least purely computational hardware, will never get the same accolades as the software that actually makes it do something.

Interface hardware, being is perceptible to the senses, gets credit over software.

E.g. when people experience a vivid, sharp high resolution display, they attribute all its good properties to the hardware, even if there is some software involved in improving the visuals, like making fonts look better and whatnot.

If a mouse works nicely, people attribute it to the hardware, not the drivers.

If you work in hardware, and crave the appreciation, make something that people look at, hear, or hold in their hands, not something that crunches away in a closet.

anonymousDan · 2024-03-28T22:06:40

I hear a lot about the energy efficiency of animal brains in comparison to e.g. GPUs. However, as far as I can tell most of the numbers reported are for adult brains, which effectively have been sparsified over time. Does anyone know of how the picture changes if we consider baby animal brains, which as I understand it have much denser connectivity and higher energy consumption than adult brains?

somewhereoutth · 2024-03-28T16:28:56

and how we'll find a way to soak up all those transistors for a perhaps actually worse user experience / societal outcome.