In-mem (generally) means no (re)loading of data from a storage device.

yjftsjthsd-h · 2024-09-07T00:24:00 1725668640

Sure, but I don't think that makes sense here; when I run an LLM on CPU, I load to memory and run it, when I run on GPU I load the model into the GPU's memory and run it, and I don't have anything like that much money to burn but I imagine if I used an FPGA then I would load the model into its memory and then run it from there. So the fact that they're saying "in-memory" in contrast to ex. GPU makes me think that they're talking about something different here.

mmoskal · 2024-09-07T02:04:50 1725674690

It's a different kind of memory chip that also does some computation. See https://en.m.wikipedia.org/wiki/In-memory_processing

adrian_b · 2024-09-07T05:15:38 1725686138

While this has been proposed repeatedly for many decades, I doubt that it will ever become useful.

Combining memory with computation seems good in theory, but it is difficult to do in practice.

The fabrication technologies for DRAM and for computational devices are very different. If you implement computational units on a DRAM chip, they will have a much worse performance than those implemented with a dedicated fabrication process, so for instance their performance per watt and per occupied area will be worse, leading to higher costs than for using separate memories and computational devices.

The higher cost might be acceptable in certain cases if a much higher performance is obtained. However it is unavoidable that unlike with a CPU/GPU/FPGA, where you can easily reprogram the device to implement a completely different algorithm, a device with in-memory computation would be much less flexible, so it either will implement extremely simple operations, like adding to memory or multiplying the memory, which would not increase much the performance due to communication overheads, or it would implement some more complex operations, which might implement some ML/AI algorithm that is popular for the moment, but which would be hard to use to implement better algorithms when such algorithms are discovered.

vlovich123 · 2024-09-07T05:43:10 1725687790

I suspect that the attempts to remove the DRAM controller and embedding it into the chips directly will succeed in meaningfully reducing the power per retrieval and increase the bandwidth by big enough that it’ll postpone these more esoteric architectures even though its pretty clear that bulk data processing like LLMs (and maybe even graphics) is better suited to this architecture since it’s cheaper to fan out the code than it is to shuffle all these bits back and forth.

p1esk · 2024-09-07T07:38:04 1725694684

In-memory doesn’t mean in-DRAM.

https://arxiv.org/pdf/2406.08413

vlovich123 · 2024-09-07T09:06:17 1725699977

Am I misreading something?

> At their core, NVM arrays are arranged in two dimensions and programmed to discrete conductances (Fig. 5). Each crosspoint in the array has two terminals connected to a word line and a bit line. Digital inputs are converted to voltages, which then activate the word lines. The multiplication operation is performed between the conductance gij and the voltage Vi by applying Ohm’s law at each cell, while currents Ij accumulate along each column according to Kirchhoff’s current law

Sounds like the compute element is embedded within the DRAM but instead of doing a digital computation it's done in analog space (which feels a bit wrong since the DAC+ADC combo would eat quite a bit of power but maybe it's easier to manufacture or other reasons to do it in analog space).

Or you're saying it would be better with flash storage because it could be used for even larger models. I think that's right but my overall point holds - removal of the DRAM controller could free up significant amounts of DRAM bandwidth (like 20x IIRC) and reduction in power (by 100x IIRC). There's value in that regardless and it would just be a free speedup and would significantly benefit existing LLMs that rely on RAM. An analog compute circuit embedded within flash would be usable basically only for today's LLMs architecture and not be very flexible and require a huge change in how this stuff works to take advantage. Might still make sense if the architecture remains largely unchanged and other approaches can't be as competitive, but it does lock you into a design more than something more digitally programmable that can also do other things.

p1esk · 2024-09-07T20:01:17 1725739277

Am I misreading something?

Yes, you are. NVM stands for 'non-volatile memory", which is literally the opposite of DRAM.

Analog computation can be done using any memory cell technology (transistor, capacitor, memristor, etc), but the result will always go through ADC to be stored in a digital buffer.

Flash does not provide any advantages as far as model size, the size of crossbar is constrained by other factors (e.g. current leakage), and typically it's in the ballpark of 1kx1k matmuls. You simply put more of them on a chip, and try to parallelize as much as possible.

But I largely agree with your conclusion.

sroussey · 2024-09-07T18:00:52 1725732052

Using analog means it will be faster (digital is slow, waiting for the carry on each bit), but I am curious how they do the ADC. RAM stuff is generally so different that not introducing logic gates in the memory makes sense.

vlovich123 · 2024-09-07T18:49:45 1725734985

Digital is slow, but I would think converting the signal to/from digital might be slow too. Maybe it's taking the analog signal from the RAM itself & storing back the analog signal with a little bit of cleanup without ever going into the digital domain?

sroussey · 2024-09-07T19:40:01 1725738001

Oh, absolutely. Never switching to digital would be the way. And not hard for low bit counts like 4. I am very interested in the methodology if they do this with 64bit.

adrian_b · 2024-09-07T09:03:36 1725699816

SRAM does not have enough capacity to be useful for in-memory computation.

The existing CPUs, GPUs and FPGAs are full of SRAM that is intimately mixed with the computational parts of the chips and you could not find any structure improving on that.

All the talk about in-memory computing is strictly about DRAM, because only DRAM could increase the amount of memory from the up to hundreds of MB of memory that is currently contained inside the biggest CPUs or GPUs to the hundreds of GB that might be needed by the biggest ML/AI applications.

All the other memory technologies mentioned in the paper linked by you are many years or even decades away from being usable as simple memory devices. In order to be used for in-memory computing, one must first solve the problem of making them work as memories. For now, it is not even clear if this simpler problem can be solved.

p1esk · 2024-09-07T15:59:01 1725724741

Let’s see: Mythic uses flash, d-Matrix uses SRAM. Encharge is the only one who uses capacitor based crossbars, but those are custom built from scratch and very different from any existing DRAM technology.

Which companies are using DRAM for in-memory computing?

adrian_b · 2024-09-07T19:32:13 1725737533

Mythic does not do in-memory computing, despite their claims.

Flash cannot be used for in-memory computing, because writing it is too slow.

According to what they say, they have an inference device that uses analog computing for inference. They have a flash memory, but that stores only the weights of the model, which are constant during the computation, so the flash is not a working memory, it is used only for the reconfiguration of the device, when a new model is loaded.

Analog computing for inference is actually something that is much more promising than in-memory computing, so Mythic might be able to develop useful devices.

d-Matrix appears to do true in-memory computing, but the price of their devices for an amount of memory matching a current GPU will be astronomical.

Perhaps there will be organizations willing to pay huge amounts of money for a very high performance, like those which are buying Cerebras nowadays, but such an expensive technology will always be a niche too small to be relevant for most users.

p1esk · 2024-09-07T19:53:35 1725738815

You don't need to write anything back to flash to use it to compute something: the output of a floating gate transistor is written to some digital buffer nearby (usually SRAM). Yes, it's only used for inference, not sure how that disqualifies it from being in-memory computing? In-memory computing simply means there's a memory device/circuit (transistor, capacitor, memristor, etc) that holds a value and is used to compute another value based on some input received by the cell. As opposed to a traditional ALU which receives two inputs from a separate memory circuit (registers) to compute the output.

adrian_b · 2024-09-07T20:34:57 1725741297

This is not in-memory computing, because from the point of view of the inference algorithm the flash memory is not a memory.

You can remove all the flash memory and replace all its bits with suitable connections to ground or the supply voltage, corresponding to the weights of the model.

Then the device without any flash memory will continue to function exactly like before, computing the inference algorithm without changes. Therefore it should be obvious that this is not in-memory computing, if you can remove the memory without affecting the computing.

The memory is needed only if you want to be able to change the model, by loading another set of weights.

The flash memory is a configuration memory, exactly like the configuration memories of logic devices like FPGAs or CPLDs. In FPGAs or CPLDs you do the same thing, you load the configuration memory with a new set of values, then the FPGA/CPLD will implement a new logic device, until the next reloading of the configuration memory.

Exactly like in this device, the configuration memory of the FPGAs/CPLDs, which may be a flash memory too, is not counted as a working memory. The FPGAs/CPLDs contain memories and registers, but those are distinct from the configuration memory and they cannot be implemented with flash memory, like the configuration memory.

In this inference device with analog computing there must also be a working memory, which contains mutable state, but that must be implemented with capacitors that store analog voltages.

You might talk about an in-memory computing only with reference to the analog memory with capacitors, but even this description is likely to be misleading, because from the point of view of the analog memory it is more probable that the structure of the inference device is some kind of dataflow structure, where the memory capacitors implement some kind of analog shift registers and not anything resembling memory cells in which information is stored for later retrieval.

janwas · 2024-09-07T12:26:41 1725712001

+1. Personal opinion: accelerators are useful today but have kept us in a local minimum which is certainly not the ideal. There are interesting approaches such as near linear low-rank approximation of attention gradients [1]. Would we rather have that, or somewhat better constant factors? [1] https://arxiv.org/html/2408.13233v1

fulafel · 2024-09-07T06:56:56 1725692216

Not in the context of discussing hardware architectures.

(Context in the abstract is "First, we present the accelerators based on FPGAs, then we present the accelerators targeting GPUs and finally accelerators ported on ASICs and In-memory architectures" and the section title in the paper body is "V. In-Memory Hardware Accelerators")