Hacker News new | past | comments | ask | show | jobs | submit login
Cerebras Systems unveils a 1.2T transistor chip for AI (venturebeat.com)
227 points by modeless on Aug 19, 2019 | hide | past | favorite | 263 comments



There are far more transistors in this chip than neurons in the human brain. In 100 of these chips, there are more transistors than there are synapses in the human brain.

I don't mean to suggest that transistors are equivalent to neurons or synapses; clearly they are very different things (though it is not clear that neurons are much more computationally powerful than transistors, despite assertions from many people that this is the case). But I think it is still useful to compare the complexity of structure of chips vs. the human brain. We are finally approaching the same order of magnitude of complexity in structure.

Also note that this is not manufactured on TSMC's highest density process. Assuming TSMC's 3nm process development is successful, that will probably be 6x denser than the 16nm process used here.


A few buckets of sand also contain more elements than there are synapses in the human brain. They clearly lack any structure, and so are computationally stuck at a big fat '0' no matter what you do with the sand (unless you want to turn it into a giant abacus or turn it into integrated circuits).

Where on the scale between a few buckets of sand and an actual working human brain this chip is is not so much a function of it's structure but of what it does, and the brain's finer structures are so complex that even modelling a few neurons is going to crush that complexity downwards by so many orders of magnitude that we simply hope that we are not accidentally throwing out the useful bits in the simulation.

That's a roundabout way of saying that I think that:

> We are finally approaching the same order of magnitude of complexity in structure.

Is not necessarily true in a way that is relevant. The parts count or basic interconnection may have nothing to do with how the brain is connected internally, nor with how it functions at the lowest levels.


> A few buckets of sand also contain more elements than there are synapses in the human brain.

Actually you'd need about a thousand buckets.

https://www.quora.com/How-many-grains-of-sand-could-fit-in-a...

---

Ultimately I think these numbers do matter. As chips approach the size of human brains, only 1,000,000,000x faster, you remove the question of engineering and scale, and are left with just research.


We don’t know what numbers are the numbers that we need to care about. How much do chemical gradients matter? Inhibitors? Hormones? What other factors are at play?

I read so many people that have both simplified views of DNNs as well as real neurons that they get overly excited about the possibility of somehow getting to AGI by just having enough transistors. There was similar thinking in the 1950s when proto-AIs could play chess.

Cognition is extraordinary. The theory of mind encompasses many important features that we have no use how to approach.

To say all that’s left is research doesn’t really say much at all. It’s like saying for teleportation all that’s left is research as we have lots of bandwidth, or that for month-long battery life all that’s left is research as we have the phones ready to take them.

The gap between our knowledge Of the brain and what’s being applied to machine learning is a big leap.


Brains run on computation. Computation is fungible. The highest bandwidth signal in the brain is through neuron spiking, and there isn't much space for other mechanisms to hide. Take a simple 10 virtual neurons:1 synapse ratio as a rough upper bound of how much computation the brain can be doing, and by necessity the rest is architecture.

This is not at all like teleportation. It is absurd to claim numbers much larger than my own. 100:1? Where would the computation be? It is completely unnecessary to claim that the brain must have vast amounts of hidden capacity to do what it does, rather than the secret being in the ‘software’, and it goes against most of what we know about the brain, its ancestry, and computation.

This is not like teleportation. The brain isn't magic. I don't have to claim to know how the brain works to point this out.


For the in-brain computation, this may be true. However there may be non-local quantum effects emergent from the functioning of this massively parallel biological neural network apparatus, and information could be received and transmitted outside of the five senses. Whether this information or functionality is necessary for basic functioning of the brain, we don't know yet. I'm guessing it is, but I'm also guessing that we can supplant it by little more computation.


And that butterfly just caused a tornado that killed the original poster.

Quantum effects, while nonlocal, are small magnitude, not even microtubules are sensitive enough for this at the temperatures brain runs at.

It is nigh certain that mind does not depend on this magnitude of effects. There's way too much chaos inside even a single neuron, much less a whole network.


Is anything magic? It doesn't really matter when a thing is so inscrutable that it is indistinguishable from magic, especially if you believe that your personal subjective experience is 100% the result of brain function.


Maybe it's best to wait until after we've hammered away at the problem on warehouses of human-scale hardware for 50 years before calling it inscrutable? The paper that put neural networks on GPUs is ten years old. I get that the problem looks hard, but so do lots of things.


Please don’t take this as a personal attack. Your comment and certainty sound like religious fanaticism. The belief that consciousness is merely computation is unproven, and a matter of faith. It always seems a little short sighted when people claim to know something about consciousness and the mind based on a belief that the brain amounts to a biological computer. As an example of an alternative possibility to brain as a biological computer: in a holographic universe the brain itself is an emergent system. Or, Buddhists say that the material world as we perceive it is interdependent with mind, which is more fundamental than form. Some interesting phenomena that present problems with mind as programming running on a biological computer are the placebo/*cebo effects and Wim Hof :). Mind over matter, man.


qualias (consciousness, feelings, senses) might be impossible to create through transistors. But it does not matter, we can build something functionally equivalent in a computable way. This is called: a philosophical zombie https://en.m.wikipedia.org/wiki/Philosophical_zombie


A computing machine (not taking environmental factors like cosmic rays into account (though, why not?)) is deterministic, unless it interfaces with a truly random source of entropy. My physicist friend has said, there may be no randomness, only complexity.

Is all deterministic or do conscious entities have free will, or...?

The decisions made by your philosophical zombie will either be predetermined by the programming, or seeded by randomness. The decisions made by such a machine cannot be proved to be functionally equivalent to the decisions made by a conscious entity, and belief in such is a matter of faith.

Belief in randomness itself is a matter of faith. It’s impossible to prove that any event or measurement is truly random, and yes, there may be hidden variables- there’s no way to test.

There is no rational reason to believe that conscious behavior, consciousness, feelings, and all the activity of the mind, are functionally equivalent to a program and/or randomness.

Sounds like you are believer in the religion of materialism.


> There is no rational reason to believe that conscious behavior, consciousness, feelings, and all the activity of the mind, are functionally equivalent to a program and/or randomness.

On the contrary, the number of tasks, that a human can do, but a machine can't is shrinking. That recently started including art.


Is all deterministic or do conscious entities have free will? Free will vs determinism is a fun debate for beginners in philosophy. But it is not a debate per se, the answer is obvious once asked seriously in a well defined manner. So firstly you will agree with all scientists on earth that the world, that matter is causal and totally predictable on each of it's properties though calculus. https://en.m.wikipedia.org/wiki/Causality_(physics) Something contra-causal has never been observed in all humanity history. The conception that a human (made of matter after all) is totally deterministic is consistent and explain human behavior. It is epistemologically weak to ask an alternative (free will) that needs another premise. Let's define free will : "Some conceive free will to be the capacity to make choices in which the outcome has not been determined by past events." This means: the free choice come not from the past, it comes from nowhere. To make such a thing possible a brain would need to create primary cause (sorry the English Wikipedia page does not yet exists https://fr.m.wikipedia.org/wiki/Cause_premi%C3%A8re). A prymary cause is a cause that come from nowhere and create consequences and causes that are determined by the primary cause. Only the primary cause has "free will power". So are there primary cause in the universe? Happily not, otherwise science would be broken and the world chaotic, unpredictable. There is one true candidate for primary cause: the big bang. To think that a brain is able to break the law of physics to create a primary cause is maybe not impossible but highly unlikely, and totally unscientific, mostly religious Many things can be said this premise: When does it start? Does a bacteria has this power? An ant? A fish? A cat? Or only humans? Because language allow free will? Or because language allow Suffisant complexity to hidden the deep causes between or choices and create an illusion of choice? So free will has no explanatory power, determinism is the most observed thing empirically, only humans have it? (because it is trivial to model deterministically mammals) which add even more ad-hoc-ness. Another question is: what is the frequency of free will on a human ? Are 80% of your thoughts free? Or PI%? Another ad-hoc Ness but the more frequency of free will you believe the more epistemologically weak your belief is.

But even after all this, let's say that free will exist, what does it change? Can it allow things that determinism cannot? Firstly free will is NOT free.. Imagine you ask me to solve a problem. Let's say that knowledge about A is necessary to find a solution. If I don't know A and is undeducible from my prior knowledge then I can't solve the problem, free will or not. My past is unsufficient, and a primary cause cannot give me knowledge (sadly ;)) Humans choices and thoughts are bounded by knowledge. They are also bounded by fluid intelligence. Humans have two single goals in their lifes: Maximize their happiness and maximize others Hapinness. It is the only meaningful thing in life (qualias). So it is important to understand that for each situation in life there is an optimal choice and maybe competing <= choices. The goal of a human is to find the most optimal choice, it necesssiate a lot of knowledge and a lot of rationality training e.g learn cognitive biases, logical fallacies, skepticism, the scientific method, learn more words, etc. The more you gain erudition and rationality the less choices appear as interesting as what you know is the most optimal you can find. The ideal omniscient person has no choice to make, she knows what choice is optimal. A choice is a choice when you don't know enough the consequences and so you make a bet (which e.g can maximize risk/return or minimize it) Once you know the optimal choice what does free will means? It means the freedom to make an irrational, worse choice than the optimal one. Such a useless concept (when you know the optimal one)

Free will would only apply on a list of choices when you cannot rank any of those choices, you don't know if any choice from the list is better than any other. In such a (rare) situation, free will would allow true randomness of choice. What's the point? And more than that, it's refuted. Humanity has built pseudo random number generators that are by far good enough and if you wanted from such a list of choice make a random choice, you better use a PRG! Because if we empirically measured your believed random choices, they would not at all be random. A common proof of that is the game paper scissor rock where people have big difficulties to not repeat patterns.

So I have shown that 1) free will is one of the weakest possible belief on earth, epistemologically. And 2) that it has no explanatory power (does not explain even one thing that determinism cannot) And 3) that it necessarily reduce to true randomness and that it would useless (cf "what's the point") and that it is empirically shown to be wrong. Sadly, when you have since a long time a strong belief, with emotional affect, reading a sound refutation does not allow the reader to change it's belief. You have probably not the cognitive freedom to now thinks "free will doesn't exist and if it would it would be useless" (see the irony about freedom?)[1] because human brains are buggy. Firstly when not trained we have an inability to see (on others and on ourself) logical fallacies. https://en.m.wikipedia.org/wiki/List_of_fallacies

And we cannot see our cognitive biases https://en.m.wikipedia.org/wiki/List_of_cognitive_biases which is in itself a bias called the blind spot bias.

[1]This does not apply or less, if you are trained to rationality. Lesswrong.com Rationalwiki.org And Wikipedia are good places

"Sounds like you are believer in the religion of materialism." You got me! Well to be precise, materialism is now called physicalism and it is synonymous to be a believer of the scientific method. I'll take that as a compliment. But it is anthitetic to" religion", Popper would have a heart attack reading your sentence ^^ You know what, you are a physicalist too even if you ignore it. All human progress is driven by science, it would be time to recognize that.


Using fine sand I make that out to be about 6 buckets, give or take, besides, this isn't about whether it is 6, 10 or even 100 (or 1000) buckets, it is about how the number itself does not make a huge difference. Before if you wanted 1.2T transistors you needed 50 or so regular large dies, and presumably there have already been dies with more than the 21 billion that I'm referring to here (the NVidia 5K cores chip).

If that was all it took then we'd have had 'human brains' in a box long ago, after all, 50 GPUs is something that isn't all that rare in plenty of institutions.


There are 100 trillion synapses in the brain.

I'm not claiming Cerebras is human scale; I think it's a factor of 100-1000 off personally. I'm just saying I don't think the comparison is meaningless.


Time will tell. I personally put the time between now and AGI well in the decades, possibly more than a few centuries. This is not a mere matter of engineering, it is one of fundamental understanding of which we likely have achieved only very little.

Imagine seeing the wiring diagram of an alien computer with parts on the order of 6 orders of magnitude smaller than the ones that you are familiar with without knowing the contents of its memory and then you are expected to make a working copy. Good luck.


I have written a book about the true computational power of pyramidal neurons. Spoiler: they are A LOT more powerful than transistors. The book is free @ http://www.corticalcircuitry.com/


I'm a few pages in and with found it fascinating, humorous, compelling and understandable even with zero background in the field. Thanks for sharing.


Thank you for the kind words!


I made the mistake of starting to read this in bed last night and got totally engrossed. Looking forward to finishing it off over the next few days. Many thanks for writing this and sharing it.


I have started reading the first two chapters of your book and I really like it. It's well written and easy to understand. If it continues to be this wonderful for the remaining chapters I will buy it as well. Thanks for writing it.


I haven’t seen a perceptron circuit done in less than 4 transistors plus a few diodes. However these chips are almost certainly digital and will use a lot more to perform the floating point math.

Single transistors can’t even compute our simplified model of a neuron let alone the complexities of the real thing.


If we're going to talk about net computational power per unit (neuron/transistor), we also need to note that single transistors do stuff a whole lot faster than biological neurons, too. If we had neurons that were 1,000x faster, could we use fewer of them for the same task?


Well, we can turn that argument around: if an artificial brain could be faster than a real one, does that mean that right now we have slow artificial brains that are just as good as the real ones? The answer being an obvious 'no' may help answer your question.


Just because we don't know a program that creates a decent artificial brain doesn't mean that such a program doesn't exist for the hardware we have.

We're almost certainly going to be able to make speed/size trade-offs in AGI, because we're used to making them in all other computing.

Even biological systems make these types of trade-offs. We're beings optimized for the environment we evolved in, not optimized for "smarts". There's many "choices" in energy balance, reaction time, etc, that have been made.

Transistors are really powerful and fast computing devices; really our use of them in digital logic in some ways greatly under-utilizes them for ease in design and predictability. Saying that it takes 4 transistors to make a really simple neuron analog in a circuit undersells that those 4 transistors can operate many, many orders of magnitude faster.


Faster than what? There are a hundred trillion atoms in a neuron operating in a coordinated fashion to create its global behavior in the context of the brain. How many interactions between those components occur in a billionth of a second? Many.


It would be very hard to tell if an AGI running 1000x slower than a brain were intelligent. Humans babies don’t talk for about a year, but nobody has tried firing up a putative slow AI and waiting for 1000 years.


Then let's run 1000 of them in parallel, right?

You're on to the exact point I'm trying to make. The whole idea that 'just' complexity or 'just' speed is enough to make something intelligent that is not yet intelligent is off-base to a degree that resorting to metaphors to try to explain why also breaks down.

I liken intelligence to life: once it exists it is obvious and self-perpetrating but before it exists it is non-obvious and no amount of fantasizing about what it would take to make it will get you incrementally closer until one day you've got it. But that day will start out just like the thousands of days before it. It's a quantum leap, a difference in quality, not an incremental one. Yes, enough order of magnitude change can make a quantitative change into a qualitative one. But so far the proof that this holds for intelligence is eluding us. Maybe we're just not intelligent enough, in the same way that you can't 100% describe a system from within the system.


But it doesn't parallelize. Watching 1000 babies for their first 0.365 days isn't going to reveal their potential.

More practically, deep learning experiments were tried in the 80s, but the results weren't encouraging. If they'd left the experiments running for a sufficiently long time, they might have gotten great results. But they gave it a few days, saw the loss rate plateau and hit ^C, and that was that.

Most likely, the early deep learning experiments had some parameters wrong. Wisdom about choosing parameters only came once people could run multiple experiments in a week. So it's likely the same with some hypothetical AGI. Early experiments will have some parameters wrong, and it'll either go into seizures or catatonia. We won't be able to get the parameters right until we can run experiments faster than human brains. Say, simulating 5 years of life in a week, or 250x real time.


Wow, that is a thought I didn't have before. Thanks!

Still, we wouldn't even know what real time is, or what the "x" is, rather, until after we have achieved AGI. Also a mind bender.


Right, we have no idea how much computation it takes. Discoveries usually start with a less efficient algorithm, and then the performance gradually improves once we know what it needs to do. We might need 1000x the compute power to discover something as it'll eventually need once we deploy it. So I'm a big fan of ridiculously fast hardware like this article describes.


Yes, we are indeed in violent agreement.


If it's possible to implement AGI on a computer, a 4004 with enough RAM attached can do it... very slowly. So the speed/size trade-off is implicit unless we take a pseudo-spiritualist view that computing machines just can't be intelligent.

There's two big options out there: either it requires much, much more computation than we can readily employ even to do it relatively "slowly", or we don't know how.

The former has a little bearing on the balance between a transistor's computational capabilities and a neuron's, but even so is largely orthogonal.

As to parallelism, it doesn't work that way. If you get nine women pregnant, you don't get a baby per month. If we've made a working but far-too-slow-for-us-to-realize-it AGI, making more of them doesn't help us understand the problem.


> we don't know how

Exactly.


I think that part of the idea of adding more speed and data to solve the AI problem started with Norvig's presentations on how more data resolved the search problem better than using more clever algorithms.

People are using the 'more data' approach because it makes a (minor) dent at the problem. It's the only tool they have right now.

It is my opinion that it is not enough to make us smarter in understanding the quantum leap necessary.


> If we had neurons that were 1,000x faster, could we use fewer of them for the same task?

Absolutely. In fact, we do: in batched processing (pervasive in deep learning today), CNNs and other weight sharing schemes (e.g. transformers) the same "neurons" get re-used many times.


Those aren't neurons, though. Biology is different, synaptic connections have to physically fit, etc. Individual neurons serve multiple purposes, but I don't think we know if their speed is a limiting factor in that.


You're mixing terminology.

A neuron is not a synapse.

- There are ~100 billion neurons in the human brain

- There are 100-1000 trillion synapses in the human brain

There's strong evidence to suggest that each synapse is more akin to a digital perceptron than a neuron is. Synaptic cleft distance, transmitter types, dendritic structure, reuptake, etc are all factors that can allow for some level of long-term storage and mediation of subsequent neural activation.

[Edit] Maybe not "mixing" but seemingly comparing apples and oranges.


It takes around 15-20 transistors to make a simple neuron simulation:

https://www.quora.com/How-many-transistors-can-be-used-to-re...

But: "Synapses are usually separately modeled in transistors (they are not part of the neuron circuits described above) and dramatically add to the transistor count."

"[A neuron] has on average 7000 synaptic connections to other neurons."

So we're not there yet, but it's definitely an improvement!


Wafer scale integration, the stupid idea that just won't die :-)

Okay. I'm not quite that cynical but I was avidly following Trillogy Systems (Gene Amdahl started it to make super computers using a single wafer). Conceptually awesome, in practice not so much.

The thing that broke down in the '80s was that different parts of the system evolve at different rates. As a result your wafer computer was always going to be sub-optimal at something, whether it was memory access or an I/O channel standard that was new, changing that part meant all new wafers and fab companies new that every time you change the masks, you have to re-qualify everything. Very time consuming.

I thought the AMD "chiplet" solution to making processors that could evolve outside the interconnect was a good engineering solution to this problem.

Dave Ditzel, of Transmeta fame, was pushing at one point a 'stackable' chip. Sort of 'chip on chip' like some cell phone SoCs have for memory, but generalized to allow more stacking than just 2 chips. The problem becomes getting the heat out of such a system as only the bottom chip is in contact with the substrate and the top chip with the case. Conceptually though, another angle where you could replace parts of the design without new masks for the other parts.

I really liked the SeaMicro clusters (tried to buy one at the previous company but it was too far off axis to pass the toy vs tool test). Perhaps they will solve the problems of WSI and turn it into something amazing.


Given these chips are purely meant for machine learning the issue that "your wafer computer was always going to be sub-optimal at something" is less of an issue now than in traditional scientific programming setups, especially of the Trillogy / Transmeta days.

You have silicon in place to deal with physical defects at the hardware level.

You have backprop / machine learning in place to deal with physical deficiencies at the software level.

The programmer operates mostly at the objective / machine learning model level and can tweak the task setup as needed softly influenced by both the hardware architecture (explicit) and potential deficiencies (hopefully implicit).

The most extreme examples I've seen in papers or my own research: parts of a model left accidentally uninitialized (i.e. the weights were random) not impacting performance enough to be seen as an obvious bug (oops), tuning the size and specifics of a machine learning model to optimize hardware performance (i.e. the use of the adaptive softmax[1] and other "trade X for Y for better hardware efficiency and worry about task performance separately"), and even device placement optimization that outperforms human guided approaches[2].

Whilst the last was proven across equipment at a Google datacenter with various intermixed devices (CPUs, GPUs, devices separated by more or less latency, models requiring more or less data transferred across various bandwidth channels, ...) it's immediately obvious how this could extend to optimizing the performance of a given model on a sub-optimal wafer (or large variety of sub-optimal wafers) without human input.

Whilst reinforcement learning traditionally requires many samples that's perfectly fine when you're running it in an environment with a likely known distribution and can perform many millions of experiments per second testing various approaches.

For me working around wafer level hardware deficiencies with machine learning makes as much or more sense as MapReduce did for working around COTS hardware deficiences. Stop worrying about absolute quality, which is nearly unattainable anyway, and worry about the likely environment and task you're throwing against it.

[1]: https://arxiv.org/abs/1609.04309

[2]: https://arxiv.org/abs/1706.04972


I was being facetious:

I was working right next to andy grove and other legendary CPU eng's at intels bldg SC12....

I was young and making comments abt "why cant we just do this, or that"

---

it was my golden era.

Running the Developer Relations Group (DRG) game labe with my best friend Morgan.

We bought some of the first 42" plasma displays ever made.... and played quake tournaments on them...

We had a T3 directly to out lab.

We had the first ever AGP slots, the first test of the unreal engine...

We had SIX fucking UO accounts logged in side-by-side and ran an EMPIRE in UO.

We would stay/play/work until 4am. It was fantastic.

---

We got the UO admins ghosting us wonersing how we were so good at the game (recall, everyone else had 56K modems at best... we were on a T3 at fucking intel...

We used to get yelled at for playing the music too loud....

---

Our job was to determine if the Celeron was going to be a viable project via playing games and figuring out if the SIMD instructions were viable... To ensure theree wasa . capability to make a sub $1,000 PC. Lots of ppl at intel thought it was impossible...

GAMES MADE THAT HAPPEN.

Intel would then pay a gaming/other company a million dollars to say "our game/shit runs best on Intel Celeron Processors" etc... pushing SIMD (hence why they were afraid of Transmeta, and AMD -- since AMD already had won a lawsuit that required Intel to give AMD CPU designs from past....

This was when they were claiming that 14nm was going to be impossible...

What are they at now?


> Dave Ditzel, of Transmeta fame, was pushing at one point a 'stackable' chip. Sort of 'chip on chip' like some cell phone SoCs have for memory, but generalized to allow more stacking than just 2 chips. The problem becomes getting the heat out of such a system as only the bottom chip is in contact with the substrate and the top chip with the case. Conceptually though, another angle where you could replace parts of the design without new masks for the other parts.

I'm imagining alternating between cpus and slabs of metal with heatpipes or some sort of monstrous liquid cooling loop running through them.


Actually diamond is a really good heat conductor. Dave's big idea was that the actual "thickness" of the chip that was needed to implement the transistors was actually quite thin (think nanometers thin) and that if you put one of these on top of another one, you could use effects like electron tunneling to create links between the two, and even bisected transistors (where the gate was on one of the two and the channel was on the other).

So here is an article from 4 years ago on diamond as a substrate: https://www.electronicdesign.com/power/applications-abound-s... which talks about how great it is for spreading heat out. And one thought was that you would make a sandwich of interleaved chip slices and diamond slices, still thin enough to enable communication between the two semiconductor layers without having to solder balls between them.

In that scenarios the package would have an outer ring that would clamp contact the diamond sheets to pull heat out of everything to the case.

Of course the mass market adopted making chips really thin so that you can make a sleek phone or laptop. Not really conducive to stacks in the package. Perhaps that was the thing that killed it off.


Low-Cost 3D Chip Stacking with ThruChip Wireless by Dave Ditzel

https://www.youtube.com/watch?v=S-hBSddgGY0

Fascinating idea, no idea about the practicalities.


Wow, thanks for finding that!


Havent hear Tranmeta in a LONG time.

I recall when I was at intel in 1996 and I used to work a few feet from Andy Grove... and I would ask naive questions like

"how come we cant stack mutiple CPUs on top of eachother"

and make naive statements like:

"When google figures out how to sell their services to customers (GCP) we are fucked" (this was made in the 2000's when I was on a hike with Intels then head of tech marketing, not 1996) ((During that hike he was telling me abt a secret project where they were able to make a proc that 48 cores)) (((I didnt believe it and I was like "what the fuck are they going to do with it)))?? -- welp this is how the future happens. and here we are.

and ask financially stupid questions like:

"what are you working on?" response "trying to figure out how to make our ERP financial system handel numbers in the billions of dollars"

I made a bunch of other stupid comments... like "apple is going to start using Intel procs" and was yelled at by my apple fan boi " THATS NEVER GOING TO FUCKING HAPPEN"

But im just a moron.

---

But transmeta... there was a palpable fear of them at intel at that time....


A lot of the transmeta crowd, at least on the hardware side, went on to work at a very wide variety of companies. Transmeta didn't work, and probably was never going to work, as a company and a product, but it made a hell of a good place to mature certain hardware and software engineers, like a VC-funded postdoc program. I worked with a number of them at different companies.


I was at Transmeta. It was good for my career!


So what specifically are you doing now??


I was a QA Engineer at Transmeta, got an MBA, work in tech recruiting for several years, and now I'm a software engineer.


click on their profile


New improvements(zeno semi) do talk about 1T sram, and dram chiplet solution will have the same limits in memory size as wafer-scale, So maybe the sram vs dram gap will be close enough?

And as for IO, maybe it's possible(thermally) to assemble this on an IO interposer ?


When I speculated on brain = partly general-purpose analog, this wafer-scale project was one of the ones I found digging for research like that:

http://web1.kip.uni-heidelberg.de/Veroeffentlichungen/downlo...

Pretty neat. I don't know how practical.


For cooling maybe a phase change liquid would work such as Novec, then you would just need each chip to be exposed to the Novec and your packaging could be much smaller without heatsinks.


An amazing stride in computing power. I'm not convinced that the issue is really about more hardware. While more hardware will definitely be useful once we understand AI, we still don't have a fundamental understanding of how AGI works. There seem to be more open questions than solved ones. I think what we have now is best described as:

"Given a well stated problem with a known set of hypotheses, a metric that indicates progress towards the correct answer, and enough data to statistically model this hypothesis space, we can efficiently search for the local optimum hypothesis."

I'm not sure doing that faster is going to really "create" AGI out of thin air, but it may be necessary once we understand how it can be done (it may be an exponential time algo that requires massive hardware to execute).


Exponential time algos aren't really practical on any hardware except quantum computers, and only for algorithms that benefit from quantum speedup such as the quantum fourier transform[1]. The quantum fourier transform goes from O(n2^n) in a classical computer to O(n^2) in a quantum computer.

[1] https://en.wikipedia.org/wiki/Quantum_Fourier_transform


If AGI is a halting oracle, then hardware performance is irrelevant.


What's there to understand? We know how intelligence "happened". We just need to build a few million human-sized brains, attach the universe, and simulate a few billion years of evolution. Whatever comes out of that could just be called "AGI" by definition.


I mean, say, photosynthesis also happened by evolution. There is still a lot to understand about photosynthesis, and we do understand now, although it took decades of research.


Sure, but we understand enough so that we can recreate photosynthesis in a lab. There are many drugs which do work, but we don't know exactly how they work.

The point is, it's likely not a necessary precondition to understand AGI in order to create it.


We don't know that's how intelligence happened. That's just speculation based on materialist assumptions. Intelligence is most likely immaterial, i.e. abstract concepts, free will, mathematics, consciousness, etc. In which case, it's beyond anything in this physical universe.


> it's beyond anything in this physical universe

Your way of thinking leads to dualism, which has been proven to be a bad approach to the problem of consciousness. Dualism doesn't explain anything, just moves the problem outside of the 'material' realm into fantasy lala-land.


Hey, if materialism cannot explain reality, then why stick with a bad hypothesis? Sticking your fingers in your ears and calling any alternative 'lala-land' sounds pretty anti-intellectual.


I am not a materialist, I am a physicalist. And la-la land is just a fit metaphor for thinking that conscious experience is explained by a theory that can't be proven or disproven. If you think your consciousness is received by the brain-antenna from the macrocosmic sphere of absolute consciousness (or God, or whatever you call it) then remember about your mother and father, and the world that supported your and fed you experiences. They are responsible for your consciousness. Your experience comes from your parents and the world, and you are missing the obvious while embracing a nice fantasy. And if you don't make babies you won't extend your consciousness past your death. Your parents did, and here you are, all conscious and spiritual.


How do you prove/disprove physicalism?


> Intelligence is most likely immaterial, i.e. abstract concepts, free will, mathematics, consciousness, etc. In which case, it's beyond anything in this physical universe.

For the purposes of my argument, intelligence is a set of behaviors that would be described as "intelligent" by our contemporaries. I have no use for an inscrutable philosophical definition of intelligence, consciousness, free will, or any of that stuff.

I'm convinced that these behaviors arose as a result of natural physical processes and that if we were to reproduce them, we would "most likely" receive similar results. We can already observe this process by simulating simpler life forms[1].

Of course that's speculation, but it has better foundations than "it's beyond anything in this physical universe". That's the scientific equivalent of "giving up".

[1] http://openworm.org/


Why is that the same as 'giving up'? I call it a better hypothesis. It's like saying there is a halting problem and there are problems that are fundamentally unsolvable by computers. If true, it's pointless trying to create an algorithm to solve the halting problem. Instead, perhaps the mind is a halting oracle, and we can progress even further in the sciences by that realization.


> Why is that the same as 'giving up'? I call it a better hypothesis.

It's not a hypothesis, because it doesn't explain anything. It's also not falsifiable, so it is not useful for the purposes of science.

> It's like saying there is a halting problem and there are problems that are fundamentally unsolvable by computers

The halting problem is a rigorously defined mathematical problem which has a mathematical proof that demonstrates that it is indeed unsolvable. That's entirely different from saying that "there might be such a thing as a halting problem and it's probably unsolvable".

Furthermore, the halting problem is abstract from physical reality. It is true regardless of anything observed in physical reality. In physical reality, there is no such thing as a program that will not terminate, because there is no such thing as a computer that runs forever. The "physical halting problem" is solvable. The answer is always: The program will halt at some point because of thermodynamics.

Similarly, it is my position that the physical phenomenon of intelligence can be entirely explained by natural processes involving no metaphysics whatsoever.

> Instead, perhaps the mind is a halting oracle, and we can progress even further in the sciences by that realization.

Perhaps it is, but there is no such realization. You haven't even defined what you mean by intelligence and why that would even require a non-materialist explanation.


Well then, how is materialism falsifiable? If it isn't, how is it a scientific hypothesis?


> Well then, how is materialism falsifiable? If it isn't, how is it a scientific hypothesis?

It isn't a scientific hypothesis.

You use the word "materialism" to segregate yourself philosophically, but I haven't made a philosophical argument.

Let's suppose your "hypothesis" is true, but given that it is beyond physical observation, we can never hope to know that it is true or even know the probability of it being true. Then what would be the benefit in assuming that it is true?

We could conceivably save some time on trying to figure out AGI and instead do something "more productive". However, that's true of almost any endeavor. Anything you do in life has a chance of failure and an opportunity cost. Just imagine what could have become of us, had we not bothered to write HN comments!


Ok, then if materialism isn't a scientific hypothesis, why is it scientific?


I don't know what "materialism is scientific" would mean exactly. I never made that claim, so don't ask me to defend it.

Remember, you are the one who wants to fight this "materialism vs. dualism" battle, not me. I'm not saying your position is wrong. I'm saying your position seems inconsequential at best, but harmful at worst.

To illustrate this, let's take a hypothetical "dualist" explanation for disease:

"Disease is caused by spirits that attach to our bodies, but we cannot observe these spirits by any physical means and we can never interact with them in any distinguishable way."

In other words, we cannot perform experiments, we cannot learn anything, so these spirits might as well not exist at all. There is no practical difference.

Now that's all fine and well until you realize that disease is actually caused by physical processes than you can understand and intervene with. Had you been convinced that it was spirits all along, you wouldn't have bothered trying to figure out the physical processes.

Today this example may sound ridiculous, but for most of history similar convictions held people back from turning quackery into real medicine.


And on the materialist side we have similar quackery with phrenology, bogus medicines, Darwinian evolution, and the like. There is quackery everywhere. The fact that one can concoct a false explanation within either materialism or dualism means such an example doesn't help determine which paradigm is better.

The important question is whether one paradigm allows us to explain reality better than the other, and dualism does this very well. There is no materialistic explanation for consciousness, free will, abstract thought or mathematics. Instead, those who follow materialism strictly end up denying such things, and we end up with incoherent and inaccurate theories about the world.


> And on the materialist side we have similar quackery with phrenology, bogus medicines...

Yes, but the key distinction here is that phrenology or bogus medicine are testable. If you claim phrenology predicts something but then statistics show that this isn't the case, phrenology is proven wrong. It's more difficult with medicine, but it's at least possible.

Furthermore, we know that many of our "materialist" theories in physics are wrong. We know it because they disagree with experiment. However, they still have enough predictive power to be very useful.

> ...Darwinian evolution, and the like.

I'm not sure what you mean here. The principles behind Darwinian evolution are easily reproduced. Re-creating billions of years of evolution in a lab would be difficult, of course.

> The important question is whether one paradigm allows us to explain reality better than the other, and dualism does this very well.

I don't think it does it well at all. An explanation that you can not test isn't useful. In fact, a wrong explanation that you can test is more useful.

There's an infinite number of "dualist" explanations, all of which we cannot test. So which one to choose? Why stop at dualism, why not make it trialism? Why not infinitism?

Why not just say that there's an infinite number of intangible interactions, that every mind in the universe is connected with every other mind through a mesh spanning an infinite-dimensional hyperplane? You can't prove that this isn't the case, why not believe that one instead?

> There is no materialistic explanation for consciousness, free will, abstract thought or mathematics.

Well, so what? All of the "dualist" explanations are equally useless, so I might as well do without any explanation whatsoever.

> Instead, those who follow materialism strictly end up denying such things, and we end up with incoherent and inaccurate theories about the world.

Inaccurate and incoherent theories that are testable can nevertheless be very useful. Coherent theories that are untestable are useless, unless maybe you can turn them into a religious cult.


Why do you assume dualism is not testable?

Also, Darwinian evolution is well known to be false. Just read any modern bioinformatics book. There are a whole host of other mechanisms besides Malthusian pressure, random mutation and natural selection that are used to explain evolution nowadays. As Eugene Koonin says, the modern synthesis is really a sort of postmodern evolution, where there is no ultimate explanation for how it works. Koonin even promotes a form of neo-Lamarkianism. Darwin's unique contributions to the theory of evolution have been experimentally discredited. You can easily disprove Darwin yourself with all the genomic data that is online these days.


> Why do you assume dualism is not testable?

I assume that your conception of dualism is not testable because, in your own words, it is "beyond anything in this physical universe".

If it's beyond the physical universe, it cannot be observed and therefore it can not be tested. Otherwise, it would be part of physics and the physical universe just like gravity, if only hitherto unknown.

> Also, Darwinian evolution is well known to be false.

So is the general theory of relativity. Yet, it's "right enough" to allow us to make sufficiently accurate predictions about the physical world.

Furthermore, one theory being wrong doesn't make other theories "more right".

> There are a whole host of other mechanisms besides Malthusian pressure, random mutation and natural selection that are used to explain evolution nowadays.

As you say yourself, that's besides random mutation and natural selection, not instead. It would be a miracle if somehow Darwin could have gotten all of the details right with the tools available to him.

Also, in science everything is an approximation to some degree, there are always factors you disregard so that you can actually perform a prediction in a finite amount of time. There's variance and uncertainty in every measurement.

> As Eugene Koonin says, the modern synthesis is really a sort of postmodern evolution, where there is no ultimate explanation for how it works.

There's no "ultimate explanation" for anything. It's turtles all the way down.


It doesn't follow that for something to be observed it must be part of physics. If the physical universe is a medium, like our computers, they can transmit information from other entities without the entities themselves being embedded in the computers. It sounds like your argument begs the question by first assuming everything that interacts with us must be physical.

Darwin's mechanisms do not explain anything in evolution. All of his mechanisms select against increased complexity and diversity, so are useless to explain the origin of species as he originally claimed. Darwinian evolution is dead and died a long time ago. Modern evolution is very much non-Darwinian.


> It doesn't follow that for something to be observed it must be part of physics.

It's the other way around. If it can be observed, it interacts with matter. If it interacts with matter, it is within the domain of physics, by definition. I don't understand why you have a problem with this.

Physicists are well aware that we do not understand all the interactions and we do keep discovering more and more interesting phenomena, such as quantum entanglement.

> All of his mechanisms select against increased complexity and diversity, so are useless to explain the origin of species as he originally claimed.

I don't know where this criticism comes from, but it sounds like a straw man. Natural selection may select against "complexity and diversity", but random mutation puts "complexity and diversity" back in the game.


I think you are equivocating between the laws that govern material interaction and things that interact with matter. They are not the same thing. The laws of physics are developed by breaking matter down to its most uniform and granular components, and then characterizing their interaction. An immaterial soul would not be captured by such an analysis, but its interaction with matter could still be empirically identified.

As for evolution, I see where you are coming from. Randomness, like flipping a coin, is always complex and different. However, if there are only a few specific, long coin sequence you are trying to construct, then flipping a coin does not get you there. There are just too many possibilities within a couple hundred coin flips to check within the lifespan of the universe. And, if you have one of these sequences, then randomly flipping some of the coins will destroy the sequence. Just think about what happens if you flip random bits in a computer program. For the most part, unless you get really, really lucky, it will destroy the computer program. So, while a few random mutations may be very lucky and flip the right nucleotides to create new functionality, the vast majority of mutations are destructive, and will kill off the species before there is the chance to evolve new functionality.


> I think you are equivocating between the laws that govern material interaction and things that interact with matter. They are not the same thing. The laws of physics are developed by breaking matter down to its most uniform and granular components, and then characterizing their interaction. An immaterial soul would not be captured by such an analysis, but its interaction with matter could still be empirically identified.

If an "immaterial soul" interacted with matter in an empirically identifiable way, it is part of physics. The forces that interact with matter are themselves not made out of matter. To be precise, matter is only that which has a mass, but there are also particles that don't have mass. Their interactions are nevertheless part of physics and we can measure them at least indirectly. They're not abstract constructs, they're very much part of the physical universe.

In that sense, perhaps the term "materialism" is misleading, because ultimately physics is about forces, not matter. Remember, I'm not using that term for myself.

> However, if there are only a few specific, long coin sequence you are trying to construct, then flipping a coin does not get you there. There are just too many possibilities within a couple hundred coin flips to check within the lifespan of the universe.

There isn't just a single coin though. There's about 10^46 molecules in the ocean[1]. That's a lot of interactions over the span of billions of years.

> And, if you have one of these sequences, then randomly flipping some of the coins will destroy the sequence. Just think about what happens if you flip random bits in a computer program. For the most part, unless you get really, really lucky, it will destroy the computer program.

It depends on the bits flipped. Bits flip all the time[2] and people rarely notice, because they're not necessarily important bits. It also depends on whether the program will abort upon error detection. Without memory protection by the operating system, most programs wouldn't terminate, they'd keep trucking along, perhaps producing some garbage here and there. In fact, the difficult part about memory corruption bugs is that the program often won't terminate until well after the corruption has taken place.

Also, computer programs aren't like organisms exposed to nature. In nature, it's possible that a mutation kills you, but it's far more likely that some physical process or another organism kills you.

> So, while a few random mutations may be very lucky and flip the right nucleotides to create new functionality, the vast majority of mutations are destructive, and will kill off the species before there is the chance to evolve new functionality.

The vast majority of mutations are relatively inconsequential, at least for short-term survival. Our DNA mutates all the time, but also we have evolved error correction, which you probably also will not accept as evolving "by random mutation and selection".

[1]https://www.quora.com/How-many-water-molecules-are-in-the-Ea...

[2]https://en.wikipedia.org/wiki/Cosmic_ray#Effect_on_electroni...


So, it doesn't really seem we are disagreeing, it's just a matter of terminology. You seem to want to call every interacting thing 'physics', which you are free to do, but then I'm not sure what the value of the term is. And, you agree we already empirically measure many immaterial things. So you seem to agree that in theory an immaterial soul is an empirically testable hypothesis, in which case I'm not sure what your objection is.

As for the number of interactions, we have trouble reasoning about large numbers. Billions of years and molecules sound like unimaginably large numbers and of similar magnitude to trillions or decillion, even though the latter are many orders of magnitude greater. DNA sequences can be hundreds of billions of base pairs long. So, if we could only depend on random mutation and natural selection, we'd need 4^10^11 attempts to hit a particular sequence, which is more trials than even a multiverse of universes can offer.


> So, it doesn't really seem we are disagreeing, it's just a matter of terminology. You seem to want to call every interacting thing 'physics', which you are free to do, but then I'm not sure what the value of the term is.

Everything that interacts with matter is in the domain of physics. So if your concept of a "soul" (or whatever makes you a "dualist") can at least in principle be observed, we can count it in. It's then not "beyond this universe".

It would also raise a lot of questions: Are souls "individuals"? If so, does every organism afford a soul? If so, where do new souls come from when the amount of organisms increases? Where do they go if they decrease? Is there really only one soul spanning all organisms? How can we test for any of these things? What are the consequences?

It's perfectly fine to explore such questions, but I am personally not convinced that something like a soul - within or outside the realms of physics - is at all necessary to explain life or intelligence as the phenomena we can already observe. That's what we disagree on.

> DNA sequences can be hundreds of billions of base pairs long.

Perhaps, but the simplest lifeforms alive today have on the order of hundreds of thousands of base pairs.

> So, if we could only depend on random mutation and natural selection, we'd need 4^10^11 attempts to hit a particular sequence, which is more trials than even a multiverse of universes can offer.

Yes, but DNA didn't just form spontaneously, fully assembled. It formed from simpler precursors, which formed from simpler precursors still, all the way down to proteins which formed from the simplest of molecules. Those precursors don't survive because they can not compete with their successors. They could be forming in the oceans right now, but they won't progress because they'll just get eaten.

There have been experiments done reproducing it up to the "protein formation" step. Anything more complex than that is likely going to take too much time to result in a new lifeform - especially one that could survive ex-vitro.


Even staying within materialism, it isn't clear that 'everything is physics'. For example, our computers operate according to the laws of physics, but the physical laws tell us nothing about how they operate. To understand how computers operate, we need to know a lot of extra information besides the physical laws. In other words, physical laws tell us nothing about physical conditions. This is why there are many other scientific disciplines besides physics. So, since this notion of 'everything is physics' doesn't even work within materialism, it is hard to see why it would exclude immaterial entities.

And as you point out, the notion that Darwinian mechanisms can account for evolution of complexity and diversity is pure speculation. Which is why modern evolution theory does not use Darwin's theories. It uses mechanisms that we can see operating in the lab and in bioinformatics, such as bacterial horizontal gene transfer and empirically calculated substitution matrices. And the further that bioinformatic algorithms diverge from Darwin's ideas, the better they perform.


Whacking your head with a hammer seems to take chunks of it away, though, which is a bit of an argument against it being immaterial. ;)


Eating a hamburger helps me grow, but that doesn't mean I am a hamburger.

Or, another analogy, if you smash my computer, I won't be able to type this response to you, but that doesn't mean I am my computer.

If the brain is an antenna for intelligence, then damaging the antenna will damage the signal, but doesn't damage the source.


No, but you're made up of the pieces that come from the hamburger.

I get that it may be comforting to think that you are, fundamentally, some incorporeal, magical entity both special and immune to the slings and arrows of outrageous fortune... but there's no evidence for this.


It seems like all the evidence is for this, and on the other hand there is no evidence for the materiality of the mind. The only reason people believe in mind == brain is due to their materialistic fundamentalism, just like the creationists force their theories onto science.


> if the brain is an antenna for intelligence

And what causes the intelligence that is simply being channeled by such a brain? Is it turtles all the way down?

Let me counter your position with another: the brain is that which protects the body and the genes. Essentially it means to find food, safety and make babies. But in order to do that we have evolved society, language, culture and technology. It's still simply a fight for life against entropy. The brain learns what actions lead to best rewards and what actions and situations are dangerous. And if it doesn't, then death acts like a filter and improves the next generation of intelligent agents to make them more attuned to survival. And it all stems from self replication in an environment shared with many other agents and limited resources.

You see, my short explanation covers the origin, purpose and evolution of intelligence. The 'brain is an antenna' would work only if you consider the constraints of the environment as the source and the brain as the 'receiver' of signals.


Your short explanation is full of unfounded speculation, whereas my even shorter explanation relies entirely on direct evidence everyone has access to. The only reason you feel your explanation gets a pass is because you speculate in the name of materialism, which in itself is an incoherent philosophy.


I'm pretty sure I've never even heard of such evidence, let alone seen any. Philosophy is irrelevant here; what do you propose does the thinking if not the brain? Where is the direct evidence you claim?


Consciousness, free will, abstract thought, mathematics. All things that cannot be reduced to matter. All our scientific theories are filtered through all or most of the above, so the above list is much more directly evidenced than anything the sciences say.


What do you propose does the thinking if not the brain?

The only one of those things with a concrete definition is mathematics, which we can do with computers in practice or in principle (I am a mathematician). Computers are made of matter.


Computers do not do mathematics. Rather, they compute a set of rules we give them, which may or may not be consistent. It is up to the programmer to give them rules that are consistent and correspond to some abstract mathematical concept. However, the computational rules themselves are not mathematics.

I am not sure what does the thinking, but whatever it is, it cannot be the brain if thinking consists of uncomputable and non physical faculties.


That's a rather strange and nebulous "definition" of mathematics, and quite contrary to how it actually works; the rules define the abstract concepts that build mathematics, and the rules are followed by people too whenever they do maths, because they are the maths. For example, "numbers" are an abstract mathematical concept, totally unphysical, defined purely by rules, and used by people and computers. Actual mathematics can certainly be done with computers, and is, every day, by mathematicians (among others).

There's nothing demonstrably uncomputable or unphysical about thinking. On the contrary, the brain is an enormously complex network of neurons and synapses, clearly intricate enough to physically perform all known mind functions, and all in principle simulatable on a powerful enough computer system. Your magical antenna idea has no basis in reality, it is massively outweighed by real-world evidence, and you have failed to present any actual evidence to support it.


Well, I guess that's that :)

I cited a number of first hand pieces of evidence that are more directly evident to everyone than the speculation you provide, and you accuse me of not providing any evidence. I guess there is nothing further to be said.


> Consciousness, free will, abstract thought, mathematics

- I defined consciousness in a concrete way. it's adaptation to environment based on reinforcement learning and designed by evolution, for survival

- free will - it's just as real as 1000 angels dancing on a pinhead. Nothing is beyond physical. If it has an effect in this world, it's physical. If it doesn't, then it's just fantasy. Philosophers have tried to settle the 'mind body problem' for hundreds of years and finally conceded that dualism is a misguided path. What you consider free will is just randomness (stochastic neural activity) filtered through experience.

- abstract thought - a form of data compression, useful for survival. We use abstractions in order to compress experience in a way that can be applied to novel situations. It would be too difficult to learn the best action for each situation, especially that many situations are novel. So we model the world, compute future outcomes before acting, then act. If it were not so we would never get to learn to drive a car because it would take too many crashes to learn driving the hard way. But we learn to drive without dying 1000 times, and we do many things with few mistakes because we can model in an abstract way the consequences of our current situation and actions.

- mathematics - a useful model we rely on, but it's not absolute. It could be formulated in different ways, the current formulation is not the only possible one, nor is irreducible to matter. It all started when people had more sheep to count than fingers on their hands. The rest is a gradual buildup of model creation.

You are attached to a kind of transcendental thinking which is just too burdensome on Occam's razor. You presuppose much more than necessary. Human experience can be explained by the continuous loop of perception, judgement and action, followed by effects and learning from the outcomes.

Perception is a form of representation of sensorial information in an efficient and useful way. Judgement is the evaluation of the current situation and possible actions (based on instinct and past experience). Acting is just learned reflex controlled by judgement. They are all actually implemented in neural networks. They process information in a loop with the environment.

You don't need any transcendental presupposition to understand consciousness, free will, abstract thought and mathematics. Sorry to be so blunt, but we're evolving past medieval thinking into the bright future of AI, and many things that seemed magical and transcendental have been proven to be just learning (error minimisation).

I have been like you once, a zealot of spiritual thinking. After many decades and life experiences now I have a much better way to grasp the situation. I don't rely on magic or divinity or anything that surpassed the physical in the way I see the world. You probably will come around at some point and realise how little explanative power you had in the old theories, and that the new way of thinking is actually just as poetic as the old one. Nothing was lost, you don't need to defend the old ways. If you'd decide to learn more about AI, RL and game theory you will be able to philosophically appreciate the wonders of life even more than now. Thousands of years of spiritual tradition stand in contradiction to billions of years of evidence from evolution and the amazing progress of the last decades in understanding the way things work.


How can you falsify your physicalist explanations?


Brain functions under electromagnetic shielding, so are you proposing some new physical force?


I'm proposing the mind is non physical and interacts with the physical brain, analogous to a signal and antenna.


[Citation Needed]


That's precisely the point, right?


The above poster was making magical assertions as if they were well established without presenting evidence.


The evidence is literally the very things by which we perceive all other scientific evidence: consciousness, abstract thought, mathematics. All three are clearly non physical. People have to make up elaborate, incoherent explanations to try and explain how they are physical. The fact it is so hard, even impossible, to do shows the items are not physical.


Consciousness is clearly nonphysical? No offense, but that's simply nonsense. Why is consciousness modified by drugs? Why can consciousness be damaged by physical trauma? When you can interact with something in the physical world, that is a pretty strong indication that you're dealing with a physical phenomenon

In fact, it's not even clear what a "non-physical" phenomenon is if you drill deep enough. The existance of abstractions doesn't change granular reality. For example, a neural network implemented on a computer is highly abstracted, but still implemented on physical silicon.

I suggest you look up the philisophical history of dualism - you're a bit behind.


If I turn off your computer you cannot respond to me. Does that mean you are your computer?


If I smash my computer to atoms, I've destroyed it entirely - not severed a mysterious connection to its incorporeal self.

Your assertion is essentially: "I am conscious, therefore consciousness is non-physical". The conclusion does not follow from the observation.


Returning to your counter argument, do you see the connection with my computer analogy?


Yes - your computer analogy conflated me (a human person, existing in the physical world) with a nonphysical intelligence, of which we have no evidence and which isn't even characterized in a meaningful way - it's just a placeholder for "mystical stuff I feel but don't understand".

If you want to say you believe what you do on faith, or that it's your own spiritual belief and therefore none of my business, I won't begrudge you that and I won't bother you about it. You seem to be asserting, however, that there exists empirical evidence of non-corporeal souls (though you didn't use this word). I disagree on that point, vigorously.


Not quite. Your counter argument is "doing things to the brain affects our mind, which shows our mind is physical." The analogy I offer shows that the fact doing something to X affects Y does not mean that X is Y.

So, the evidence you offer that the mind is physical does not actually show the mind is physical.


You've either failed to understand, or you're being intentionally obtuse.

> Not quite

Yes quite.

> The analogy I offer shows

It does not, for reasons I explained.

You're observing physical phenomena, and concluding there must be magic hiding behind them.


Physical and nonphysical are beside the point. The point is X influences Y does not mean Y is X.

My claim is that Y is not X.

You argue that X influences Y, therefore Y is X.

I provide a counter example that shows you cannot infer Y is X just because X influences Y. You need another premise to demonstrate from your example that Y is X.


I see, you want to (poorly and incompletely) reduce my points to a syllogism, and point out that I'm not formally proving the physical nature of the mind (which is more appropriately addressed by the last few hundred years of science than a toy logic problem).

Meanwhile, you have not presented any evidence or argument (or even actual definition) of your magical antenna hypothesis.

"You can't falsify my unfalsifyable hypothesis, therefore it must be true" is not reason, especially when your hypothesis is in no way needed to explain the oberved phenomena (and is, in fact, inconsistent with all empirical observation).

You don't get to play stupid word games and declare that therefore magic is real.

Or rather, you do, but rational people will feel free to ignore you, as I am about to do.


I offered you a list of evidence a couple times.


I'd love to see some metrics on whether this idea has any merit. Because obviously any problem that can be done on a massive wafer can be done on multiple smaller wafers. The question is, are the compromises to get something working on a massive wafer killing performance to the point where just splitting the problem up efficiently would have been better. It's also important to think: Stuff isn't just going to happen to be the size to fit on this wafer, it's either huge in scale, or not. If it's not, you don't need this wafer, if it is, you probably need to partition your problem onto multiple copies of this wafer anyway. Take their comparison to a Nvidia GPU. It might be 57 times bigger, but I can rent 1000 Nvidia GPUs from AWS at the drop of a hat (more or less).

So yes, maybe they've done some interesting stuff, but they need some decent benchmarks to show off before we can really distinguish between whether this is the Boeing 747 Dreamliner or a Spruce Goose.


You get much higher bandwidth (at lower power cost) between chiplets than if you used a multi-chip design or even an interposer.

The drawback is that you suffer a really interesting thermal problem and have to figure out what to do with the wafer space that doesn't fit in your square -- probably creating lower-scale designs that you sell.

The second drawback is that you can't really match your available computation to memory in the same way you can with a more conventional GPU. So you have to be able to split your model across chips and train model-parallel. The advantage is that model-parallel lets you throw a heck of a lot more computation at the problem and can help you scale better than using only data parallelism.

Model-parallel training is typically harder than data-parallel because you need high bandwidth between the computation units. But that's exactly what Cerebras's design is intended to provide.

You also have a yield management issue, where you have to build in the capability to route around dead chips, but that's not too nasty a technical detail. But if your "chip"-level yield (note that their chip is still a replicated array of subunits) is too low, it kills your overall yield. So they're going to be conservative with their manufacturing to keep yield high.

It's not obviously broken, but it's certainly true we need benchmarks -- but not just benchmarks, time for people to come up with models that are optimized for training/inference on the cerebras platform, which will take even longer.


Why even make the final giant chip rectangular? I get that the exposure reticles are rectangular, but since this is tiling a bunch of connected chiplets, why not use the full area of the wafer?


You need to make sure you're io lines are imprinted on the edge of the wafer.

It's much easier to do by making the io lines at the ends of each "chip" , vs a circle that cuts in the middle of the "chip".


The problem is not if it bigger than nvidia (which is kind of a strange metric). It is when Amdhal law kicks in.

https://en.wikipedia.org/wiki/Amdahl%27s_law


Sorry for being pedantic , but there is no Boeing 747 Dreamliner, perhaps you meant 787?


This is very neat, but I really wish the press would say "neural net" instead of "AI". "AI" just means a computer program that has some ability to reason about data similarly to a human, neural nets are a subset of that

I guess "AI" gets you the clicks though


> "AI" just means a computer program that has some ability to reason about data similarly to a human

Not even similarly to a human, especially if you talk to the average business person. Anything software does that is in some way seen as "intelligent" is seen as AI by a layperson. I recently encountered a situation where someone I worked with was demoing automated triggering of actions based on simple conditions (eg metric A > metric B) and the business person being demoed to said something to the effect of "oh so its AI!".

Anything that is artificially intelligent in some form is seen as AI to someone to the point where the term is quite meaningless.

Machine Learning is a better term because its somewhat more specific and we're not even close to AGI yet, so if it were up to me, I'd just retire the term AI altogether as not being useful.


> Anything software does that is in some way seen as "intelligent" is seen as AI by a layperson.

I mean... isn't that the technically-correct definition? It's artificial. It's intelligent. So it's artificial intelligence.

Academia can define "AI" as a term however it likes, but in practice people are always just going to interpret it as the brute juxtaposition of the two adjectives that compose it.


Sure, I'm not arguing against it, just pointing out what we think as AI isn't what a non-tech person sees as AI.

> in practice people are always just going to interpret it as the brute juxtaposition of the two adjectives that compose it.

Indeed.


Agreed! Something like this could be really useful in accelerating today's networks, and the AI word just gets in the way.


> neural net

Which is basically matrix multiplication.


Call it "just function approximation" if you want, but calling it just matrix multiplication removes the only important part, that the matrix's value are found through tiny adjustments made to generalize the empirical distribution of a dataset, and end up approximating it's function


Well, plus nonlinearities and backprop/gradient descent, to turn those matrix multiplications into a universal function approximator.


Everything is matrix multiplication at some level!


What about matrix addition?


If you have a hammer ...


In a sense of Hamiltonians as linear operators on quantum states (such as state of the universe)....


This is one of the first of what will probably be a number of announcements out of the Hot Chips conference, happening now through Tuesday on the Stanford campus.

https://www.hotchips.org


> The 46,225 square millimeters of silicon in the Cerebras WSE house 400,000 AI-optimized, no-cache, no-overhead, compute cores

> But Cerebras has designed its chip to be redundant, so one impurity won’t disable the whole chip.

This sounds kinda clever to a semi-layperson. Has this been attempted before? Edit: at this scale. Not single-digit cores CPUs being binned, but 100k+ CPU core chips with some kind of automated core-deactivation on failure.


> This sounds kinda clever to a semi-layperson. Has this been attempted before?

Yes, by Trilogy Systems, and it went bust spectacularly. It raised +200m plus of capital (back in seventies!) and turned out to be the biggest financial failure in Silicon Valley's history.

https://en.m.wikipedia.org/wiki/Trilogy_Systems


First: thanks, this was exactly the kind of historic knowledge I hoped would show up in this thread!

Gene Amdahl ran this; geeze, no surprise they got funded.

Do you happen to know how many "compute units" this chip was designed to handle?

https://en.m.wikipedia.org/wiki/Trilogy_Systems

> These techniques included wafer scale integration (WSI), with the goal of producing a computer chip that was 2.5 inch on one side. At the time, computer chips of only 0.25 inch on a side could be reliably manufactured. This giant chip was to be connected to the rest of the system using a package with 1200 pins, an enormous number at the time. Previously, mainframe computers were built from hundreds of computer chips due to the size of standard computer chips. These computer systems were hampered through chip-to-chip communication which both slowed down performance as well consumed much power.

> As with other WSI projects, Trilogy's chip design relied on redundancy, that is replication of functional units, to overcome the manufacturing defects that precluded such large chips. If one functional unit was not fabricated properly, it would be switched out through on-chip wiring and another correctly functioning copy would be used. By keeping most communication on-chip, the dual benefits of higher performance and lower power consumption were supposed to be achieved. Lower power consumption meant less expensive cooling systems, which would aid in lower system costs.

Edit: '"Triple Modular Redundancy" was employed systematically. Every logic gate and every flip-flop were triplicated with binary two-out-of-three voting at each flip-flop.' This seems like it should it should complicate things quite a bit more dramatically.. they were doing redundancy at a gate-level, rather than a a CPU-core level.


Remember three core AMD chips? They were quad core chips where one core was non functional or didn't meet spec. They simply disable the bad core and sell it as a three core. Though it didn't do well as who wants to buy a "broken" quad core?


This is how most large-ish chips are sold. AMD no longer sells 3 core CPUs but they sell 6 and 4 core models which are partially disabled 8 core dies. The 6 core even seems to be their best selling model because it has a very good price:performance ratio. The Radeon 5700 is a cut down 5700XT, RTX 2060 is a cut down 2070, etc.


>> But Cerebras has designed its chip to be redundant, so one impurity won’t disable the whole chip.

> This sounds kinda clever to a semi-layperson. Has this been attempted before?

I think every large chip is like that nowadays. It's just a matter of degree.


Every time you see a CPU with 8 cores and one next to it with 6 cores at a lower price, they are generally the same chip but the 6 core one had 2 faulty cores which were disabled at the fab. The reason for the price difference is that chips with all good cores happen infrequently.

Designing wafer-level integration is not new either. It was in fact very fashionable in the 80s (although not very successful): https://en.wikipedia.org/wiki/Wafer-scale_integration


Yep, it’s standard practice. The old AMD tri-core chips for example were all physically four cores with a (usually) broken one disabled. Modern AMD chips use multiple chiplets on a single board to extend yields.


... and some of the new Epyc chips use only two cores per die out of eight. Customers still come out ahead since those chips have far more cache than a chip with 2 fully enabled 8 core dies.


Yes, every chip of moderate size does this. Selling floorswept chips as lower performance SKUs is very common.


Sure, that practice is pretty wellknown.

But at this scale?


It just has to detect damage and route around it. It is just an orchestration scheme in miniature.

Relevant XKCD: https://xkcd.com/1737/


whitepaper: https://www.cerebras.net/wp-content/uploads/2019/08/Cerebras...

I don't expect good yields from a chip that takes up the whole wafer. They must disable cores and pieces of SRAM that are damaged. How is this programmed?


> How is this programmed?

Full disclosure: I am a Cerebras employee.

There is extensive support for TensorFlow. A wide range of models expressed in the TensorFlow will be accelerated transparently.


He was asking about the implications for yields. Do you route around bad dies/cores, and what are the implication for programming and performance?

For everyone else: normally a wafer is divded into dies, each of which (loosely) are a chip. Yield is a percentage of good parts and it's very unlikely that an entire wafer is good. Gene Amdahl estimated that 99.99% yield is needed for successful wafer scale integration:

https://en.wikipedia.org/wiki/Wafer-scale_integration


> For example, the typical 300mm wafer from TSMC may contain “a modest hundred number of flaws,” said Feldman. Cerebras gave its Swarm interconnect redundant links to route around defective tiles and allocated “a little over 1% [of the tiles] as spares.”

https://www.eetimes.com/document.asp?doc_id=1335043&page_num...


Looking at the whitepaper, I'm a little surprised how little RAM there is for such an enormous chip. Is the overall paradigm here that you still have relatively small minibatches during training, but each minibatch is now vastly faster?


IIRC they use batch size = 1 and each core only know about one layer. Which is to say this thing has to be trained very differently from normal SGD (but requires very little memory). There is also the issue that they rely on sparseness, which you get with relu activations, but if, for example, language models move to gelu activations they will be somewhat screwed.


It's because it's SRAM, not DRAM. Think how much L3 cache your processor has. A few MB probably. That's what this chip's memory is equivalent to.


We have up to 160 GB SRAM on our WSI. The rest of the transistors can be a few million cores or reconfigurable Morphle Logic (an open hardware kind of FPGA)

Our startup has been working on a full Wafer Scale Integration since 2008. We are searching for cofounders. Merik at metamorphresearch dot org


“full utilization at any batch size, including batch size 1”

https://www.cerebras.net/


That doesn't really mean anything. It (and any other chip) had better be able to run at least batch size 1, and lots of people claim to have great utilization... It doesn't tell me if the limited memory is part of a deliberate tradeoff akin to a throughput/latency tradeoff, or some intrinsic problem with the speedups coming from other design decisions like the sparsity multipliers, or what.


Most of the chip is already SRAM, I'm not really sure what else you would expect?

18 GiB × 6 transistors/bit ≈ .93 trillion transistors


Well, it could be... not SRAM? It's not the only kind of RAM, and the choice to use SRAM is certainly not an obvious one. It could make sense as part of a specific paradigm, but that is not explained, and hence why I am asking. It may be perfectly obvious to you, but it's not to me.


You basically have the option between SRAM, HBM (DRAM), and something new. You can imagine the risks with using new memory tech on a chip like this.

The issue with HBM is that it's much slower, much more power hungry (per access, not per byte), and not local (so there are routing problems). You can't scale that to this much compute.


But HBAM and other RAMs are, presumably, vastly cheaper otherwise. (You can keep explaining that, but unless you work for Cerebras and haven't thought to mention that, talking about how SRAM is faster is not actually an answer to my question about what paradigm is intended by Cerebras.)


They say they support efficient execution of smaller batches. They cover this somewhat in their HotChips talk, eg. “One instance of NN, don't have to increase batch size to get cluster scale perf” from the AnandTech coverage.

If this doesn't answer your question, I'm stuck as to what you're asking about. They use SRAM because it's the only tried and true option that works. Lots of SRAM means efficient execution of small batch sizes. If your problem fits, good, this chip works for you, and probably easily outperforms a cluster of 50 GPUs. If your problem doesn't, presumably you should just use something else.


Do you support MATLAB or GNU Octave? I'm looking for the level of abstraction below TensorFlow because I find pure matrix math to be more approachable. Admittedly, I'm not super experienced with TF, so maybe it can encapsulate them.

Also, do you have a runtime to run the chip as a single 400,000 core CPU with some kind of memory mapped I/O so that a single 32 or 64 bit address space writes through to the RAM router through virtual memory? I'm hoping to build a powerful Erlang/Elixer or Go machine so I can experiment with other learning algorithms in realtime, outside the constraints of SIMD-optimized approaches like neural nets. Another option would be 400,000 virtual machines in a cluster, each running a lightweight unix/linux (maybe Debian or something like that). Here is some background on what I'm hoping for:

https://news.ycombinator.com/item?id=20601699

See my other comments for more. I've been looking for a parallel machine like this since I learned about FGPAs in the late 90s, but so far have not had much success finding any.


So why are you not publishing benchmarks against nvidia?


Cerebras is an MLPerf member, so they will publish MLPerf numbers some day and then we will talk.


They probably run the benchmark (I guess many times, and not only against nvidia). But yet it is not in the white paper.

I was an SE at an hardware company and it is the first thing that you do as a product manager.


What is an SE?


Software engineer.


How do you achieve this? Tensorflow does not support openCL.


I'm sure they wrote a new backend for tensorflow that targets their API. Since the hardware is only for ML, it wouldn't make sense for them to bother trying to implement OpenCL.


If you divide that giant piece of silicon into 400k processors, and then only use the ones that actually work...

I wonder if they figure that out every time the CPU boots, or at the factory. At this scale, maybe it makes sense to do it all in parallel at boot. Or, even dynamically during runtime.

There may be edge case cores that sort of work, and then won't work at different temps, or after aging?


They'll aim to catch the logic failures and memory failures during wafer test at the factory. This testing is done at room temperature and at hot. There are margins built in to allow for ageing. If they want to ship a decent product they'll also need to repeat the memory testing every boot, and ideally during runtime but maybe the latter isn't a big deal for something like this.

EDIT: to add a bit more and possibly address the original question (which I think keveman may have misunderstood), there will usually be some hardware dedicated to controlling the chip's redundancy. Part of that is often a OTP fuse-type thing that can be programmed during wafer test to indicate parts of the chip that don't work. Something (software or hardware) will read that during boot and not use those parts of the chip.


Sure, that makes sense.

With this many cores it seems like the probability that a core dies during a multi-hour job (or in case it's used for inference, during a very long-lived realtime job) is pretty high, so the software in all layers would need to handle this kind of exception. They probably don't, today, since we haven't seen a 400k core chip before.


Wow ! 56 (!) times more core than the biggest Nvidia Volta core, a single 46 square millimeters chip (no chiplets like recent AMD chips), and an incredible wooping 18 GB of SRAM (that’s like 18GB of CPU cache basically) !

I don’t know if you guys are used to that scale but I find it monstruous !


It's wafer scale integration. All the major AI learning chips are probably heading this way.

That is, CPUs and GPUs are made on silicon wafers, typically 30cm in diameter, with many chips built on a single wafer and then cut from the wafer and packaged into products. The idea of wafer-scale integration is that instead of cutting the wafer, you just build all the communication between the computation elements in the wafer into the wafer and therefore get a "network on wafer", a single very massive "chip".

The reason to do this is that the lowest-energy way to communicate data is on-chip, and in major AI learning setups, by far the most of the power is spent on data movement. By making the largest possible chip, you minimize the data movement power requirements and spare more power for the computation.


Wafer scale integration does seem to be a logical conclusion of integrated circuit. When Jack Kilby invented IC it indeed was a stroke of genius.


Not necessarily. The problem is power density. Imagine jamming 56 CPUs that close to each other. Hard to fit the fans in!

Apparently this is going to be water cooled.


It's 1.5KW of thermal output, about 5 times the biggest Xeon.


It is monstrous. But the way I read it, it is actually similar to chiplets. Only chiplets are cut from the wafer, individually tested and then combined for the final chip. Here all parts stay on the wafer and faulty ones are routed around.


Since they are wiring the chips on the wafer and routing around defects, I imagine a chiplet design on a wafer-sized interposer would help them deal with yield issues. I wonder what their competition will do. It's certainly possible for Nvidia or AMD to bundle GPU dies on top of large interposers and TSMC has already shown large ones, though nothing on this scale.


I was a little confused when I read your comment and 46 square millimeters. I am guessing you meant 46k square millimeters as the article stated 46,225 square millimeters which yes you are right that is monstrous. Very cool! As a care giver I often discuss with my clients the cool things that have come into existence in their life time and wonder when I get old what things will my children reflect back on and say grandpa you were alive when “x” was invented how neat. Personally I am hoping it is fusion power.


The young people I work with are always shocked when I tell them there was no internet (or at least, nothing terribly useful at the consumer level) when I went to college.


Ah, stupid me :). I had no idea of the scale of things and in my native locale the comma is the decimal point...


I'll put my vote in for commodity room-temperature superconductors.


I am just wondering, why there is no similar product as FPGA array. As far as I know, it’s the cheapest way to see if there is product/market fit for a semiconductor product. High speed transceiver as well as memory controllers are included in FPGA. This single wafer approach looks very interesting to me. I was intern at Infineon some time ago and was working on device distribution characterization across 200 mm wafer. The chips in the middle were 2-3 better performing than these in the border. So how Cerberas’s chip manages this issue? Middle parts are throttled or low performing areas near wafer’s boarder are disabled? How much does it cost?.. I can imagine it being shipped on thermal pad with liquid nitrogen cooling bellow. There must be some wires bonded for interface to the host. Very interesting technical project. I am very curious what are the clients for such huge specialized chip.


An FPGA would make it much more generic a product so you can sell to more markets. But you would loose a factor 200 (a factor 1000 with traditional FPGA design) to make transistors reconfigurable.

If you leave the wafer intact, you get 70,000 mm2. Cerebras cut off almost half of the wafer.

At 7nm you would get 2.7 trillon transistors with 300mm wafers, more with 450 mm wafers. You disable those reticle-sized areas with impurities or damage at runtime.

You can cool it with immersive liquid cooling. Instead of wire bonding you can stack chips on top or use free space optics [1].

[1] https://www.youtube.com/watch?v=7hWWyuesmhs


There's a photo on the homepage: https://www.cerebras.net/ It's just about 8.5" across. You could just barely print a 1:1 photo of it on a letter-sized sheet of paper, with less than a half-milimeter margin on each side.


I wonder if they bond it to a copper slab. At that scale, the tiniest amount of PCB flex would probably shatter the die/wafer...


They do not. They have "developed [a] custom connector to connect wafer to PCB": https://imgur.com/a/sXxGbiD


I think "cold plate" is the slab.


Yes, I it is probably a copper slab (they never mentioned), but there is no electrical connection, as the rest of the slides make clear: https://imgur.com/a/Rbd7e4D

It just provides a thermal connection for water cooling. The electrical connection is made through the PCB (and probably through a thick copper plate on the opposite side of the PCB)


Is there any comparison of training speed of a neural net for one of these chips and a typical one? I'd be interested to see how long it takes to train an imagenet classifier on one of these compared to other hardware.


I remember watching the (excellent!) Tesla Autonomy Day presentation, and an analyst was asking about 'chiplets' but was met with a bit of dismissal from the Tesla team.

Maybe THIS is what the analyst had in mind! Pretty cool stuff, although I question how interconnect / inter-processing-unit communication would work.

Notably, no benchmarks in the press release...


"The Cerebras software stack is designed to meet users where they are, integrating with open source ML frameworks like TensorFlow and PyTorch"

What's the instruction-set? They don't say.

I assume you need to program in some DSL VERILOG-ish macro-assembler for that monster contraption. Python is probably not what works well ...


Your CoreML model can run on Apple Neural Engine, but Apple doesn't expose that hardware's instruction set. This probably works similarly.



> In another break with industry practice, the chip won’t be sold on its own, but will be packaged into a computer “appliance” that Cerebras has designed. One reason is the need for a complex system of water-cooling, a kind of irrigation network to counteract the extreme heat generated by a chip running at 15 kilowatts of power.

15 kW, yikes.


So, Azul Systems 2.0? clever(ish) hardware with good(ish) results, for too much $$$ for anyone to actually buy?


The main difference is that what Azul built had more limits - once you run your workload well enough, there is little incentive to have more compute power.

When it comes to ML, the more compute power you throw at it, the better.


I know very little about this sub-field, but from a layman's perspective, and given competition between Intel and NVidia in the field of deep learning, I would not be surprised if Intel tried to acquire or takeover this company for their single-chip designs.


Would this really be more efficient in term of cost/performance? It seems the specialized nature of the chip push the price high enough that you could build equivalent systems with traditional hardware for the same or less, and it would all be known-quantities rather that working with something brand new and not as well understood.


Specialized AI chips don't seem like a very good business idea to me.

The way we do things (in AI) today - we may be doing things completely different tomorrow. It's not like there's a standard everyone has agreed on.

There is a very real risk these specialized, expensive devices will go the way of the Bitcoin ASIC miner (which saturated secondary markets at a fraction of its original cost).

Source: I do ML consulting and build AI hardware.


Isn't BLAS a standard everyone has agreed on?

Making a matrix multiplication accelerator seems a pretty safe bet to me. I am less sure about sparsity optimization, but I guess it still works for dense matrixes even in the worst case.


The way we do things in AI today is multiplication of two large matrices. Just like we did it 30 years ago: http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf


Sure, but Cerebras isn't just multiplying two large matrices, they are multiplying two large very sparse matrices, relying on ReLU activation to maintain sparcity in all of the layers. We already have BERT/XLNet/other transformer models move away from ReLU to GELU which do not result in sparse matrices. "Traditional" activations (tanh, sigmoid, softmax) are not sparse either.


Good point. I think it's a safe bet to focus on dense dot product in hardware for the foreseeable future. However, to their defense:

1. It's not clear that supporting sparse operands in hw would result in significant overhead.

2. DL models are still pretty sparse (I bet even those models with GELU still have lots of very small values that could be safely rounded to zero).

3. Sparsity might have some benefits (e.g. https://arxiv.org/abs/1903.11257).


While the theoretical innovations have mostly been incremental, there has been a lot of progress in the development of "light" deep learning frameworks - so the tasks that previously required massive GPUs can now run on your phone. And this trend will continue.


Last I checked all those light frameworks still have to do good old matrix multiplications. What's changed?


I see your point. Fundamentally, the same multiplications.

However, if we look at TF Lite, for example - its internal operators were tuned for mobile devices, its new model file format is much more compact, and does not need to be parsed before usage. My point is - the hardware requirements aren't growing; instead, the frameworks are getting optimized to use less power.


I wish this was the case. 5 years ago I could train the most advanced, largest DL models in reasonable time (few weeks) on my 4 GPU workstation. Today something like GPT-2 would probably take years to train on 4 GPUs, despite the fact that GPUs I have now are 10 times faster than GPUs I had 5 years ago.


This seems targeted for training, not inference. It definitely seems to me compute need is growing for training. (Is TF Lite even relevant at all for training?)


The energy cost of the interconnects between many chips is much higher than between the equivalent circuits on a wafer scale integration. The performance of the interconnect between such circuits is much higher. The cost of packaging is 1/78th with wafer scale.


How do they go and package something this big? Is this supposed to be used or is it just a headline?


There's some information on this towards the end of the slides at https://www.anandtech.com/show/14758/hot-chips-31-live-blogs....

Sadly there's very little concrete info on the packaging methodology (hardly a surprise) which to me is the only truely novel thing about this chip. But it must cost an absolute fortune.


Impressive but is this a good idea?

Either you need a crazy thermal solution, or it must be way less thermally dense than smaller chips can be. And is it really that much of an advantage to stay on chip compared to going through a PCB, if distances are crazy?


They are using vertical water pipes for cooling to solve the thermal density problem. At 15 kw they're doing OK there.


I'm surprised this hasn't been done before; a monolithic wafer can achieve denser interconnects, and is arguably simpler than a multi-chip module.

On the other hand, multi-chip modules can combine the working chips in low-yield wafers, whereas a monolithic wafer would likely contain many failed blocks, uselessly taking up space / distancing the working chips.

Cooling isn't really a problem, as the TDP scales with area as in other chips. Water cooling or heat pipes can transport the heat away to a large heat sink. 3D / die-stacked chips have a harder cooling problem, potentially requiring something like an intra-chip circulatory system.


> I'm surprised this hasn't been done before

It has, a couple of people have linked to https://en.wikipedia.org/wiki/Wafer-scale_integration


It's 15kW, you need a custom liquid cooling solution that they package the chip with.


It does seem to enable interconnect with impressive latency and bandwidth. It probably is worth it considering both Google and Nvidia systems are interconnect-limited.


In their whitepaper they claim "with all model parameters in on-chip memory, all of the time", yet that entire 15 kW monster has only 18 GB of memory.

Given the memory vs compute numbers that you see in Nvidia cards, this seems strangely low.


Check the C-suite track record... a pattern of making quick selling companies on a "wow effect" which then quickly turn defunct and valueless after sale. There were big red flags about Cerebras claims for a quite some time. Some say it is the Graphcore on steroids.

Not so much about the tech side, stuff like that been tried before (without good results: the more is your reticle fill, the poorer is the exposure,) but the business wise side of their claims don't make sense.

First, and the biggest one being the economics. It is completely impossible that a run as small as 100, or even 1000 wafers be more economical than that of mass market product, even if you deduct packaging costs.

On top of that, just any process modification or "tweak" for a low volume run will destroy just any economy of scale. And as I understood, they pretty much brag about doing so.

Lastly, some tech notes. Maybe they got the issue solved, maybe not: the bigger the chip, the more memory starved it is for a simple reason of geometry.

With the "chip" the size of a wafer, it is gonna be extremely memory starved unless it has more IO than computing devices.

Then, the thermal ceiling for CMOS is around 100W per cm², and it is a very hard limit. I see no point why they brag about beating it when they truly didn't: 20 watt per square cm² is quite low for HPC.

I suspect they are indeed quite limited by thermals, if they had to backpedal on their original claims.


what's wrong with Graphcore?


That's 18GB of static RAM accessible in one clock cycle... the memory on a GPU isn't in the same class of fast. Given the bandwidth and latency of this thing, you'd likely have to use a cluster of machines doing all sorts of pre-, post- and I/O processing just to keep this thing busy.


18GB is huge! An NVIDIA V100 has 6MB of L2 memory. HBM is off-chip, and vastly (~100x) slower.


That's true, but it doesn't match their claim of keeping all of the model on the chip.

An 18 GB cache is huge for sure, but that's not what they claim.


18GB of very fast memory will still be just as hard to keep fed with data as that 6MB cache


The idea is that the whole model resides in the fast memory, so you don't need to ‘keep it fed’.


44K/core is very little memory


Indeed, but cores are only responsible for small fragments of the network, so don't need huge amounts of memory.


Unless you need to multiply large matrices, where you need access to very large rows and columns...like in...ML applications


That's what the absurdly fast interconnect is for. You send the data to where the weights are.


Absurdly fast != Single cycle

It will be physically impossible to access that much memory single cycle at anything approaching reasonable speeds. I suppose you could do it at 5Hz :)


A core receives data over the interconnect. It uses its fast memory and local compute to do its part of the matrix multiplication. It streams the results back out when it's done. The interconnect doesn't give you single-cycle access to the whole memory pool, but it doesn't need to.


I think it is telling that in one sentence there is a claim that it is faster than nvidia, and in another, a claim it does tensor flow. I do not think this architecture could do both of these at once. It could not do tensor flow fast enough (not enough local fast mem) to compete even with a moderate array of GPUs


Hey! You seem knowledgeable, mind emailing me at tapabrata_ghosh [at] vathys (dot) ai ?


18 GByte of memory with 1 clock cycle latency? That's impressive.


I assume that each byte of that memory has a 1 clock cycle latency to only one of the 400,000 cores on the wafer. That's about 45KByte of memory per core; 1 clock cycle latency to a block that small is quite reasonable.


I didn't see any indication, but I'd give ~0% chance that the full address space can be accessed in one clock.


Depends on the how fast a clock cycle is, I suppose. Its easy to make something one clock cycle if the clock ticks verrrry slowly ;)


ASICS do not support CUDA. There is a forked tensorflow with opencl support from AMD but I doubt people will use it for this ASIC. So how can tensorflow/cntk/pytorch use such hardware?


The exact same way TensorFlow supports TPU: by writing another backend. TPU doesn't support CUDA either.


So instead of having common optimized kernels for AMD, Intel, all ARM firms, all FPGAs and all ASICS. Each member is reinventing the wheel in it's own backend? Not so surprising ^^


since the article didn't post a picture of the bare chip:

https://i.imgur.com/cMo4w0C.jpg


But... can I play Crysis - oops sorry - can I train ResNet on this thing?


Interesting what will be the price and power consumption. Does it need specialized server with huge power supply module?


A 300mm wafer at 16nm node would cost more than $6500 a piece. The power consumption would be more than 20kW if all transistors are in use simultaneously. We am designing a special reconfigurable AC-DC and DC-DC power router (inverter) to supply this huge amount of power. Our WSI design will be liquid immersively cooled.


There goes Intel. Behind the curve again...


can anyone comment on the differences between cerebras and other chip startups trying to rethink the semiconductor architecture for AI? what are the main technical forks?


While this is certainly a very impressive achievement, I am personally interested in small and light AI.

World's tiniest AI chip? That would get me excited!


What's "AI chip"? You can build a full adder, and call it "an AI chip".


by that I mean any chip, from a generic (GPU) to something specialized (vision, NLP, etc). Any chip that makes training/or running TF/Caffe models faster.


Faster than what? A tiny chip will be slower than a large chip.


faster while having the same form factor, energy usage, cost.


Even under those constraints you can build something that's either fast or general. Pick one.


uuuuuuugh




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: