This looks pretty cool! What's the catch? e.g. why isn't this already implemented in accelerators, is it really just a forgotten algorithm, or this has some implications on the cost of building the accelerator or else?
It's not just a software algorithm. It's a hardware architecture optimization. To benefit, you have to build hardware that matches the dimensions of the algorithm. That's an expensive commitment.
> you have to build hardware that matches the dimensions of the algorithm
Yes the benefits are realized in custom hardware designs as opposed to software, however, the hardware architectures work for multiplying matrices of arbitrary dimensions by splitting up larger matrices into smaller tiles, then summing up the tile products to form the final larger matrix products (i.e. GEMM)
Not so much in FPGA ... although I'm not sure top end FPGAs would beat Nvidia TPUs even with this algorithm, and even if cost were not a consideration.
IMHO, for fixed-point MM accelerators, there is no catch, I think it's an overlooked algorithm. It's based on an algorithm by Winograd who coincidentally also proposed another unrelated algorithm that later became very popular for CNN acceleration which would take some visibility away from this other algorithm by Winograd... But that is speculative
On the other hand, if you tried it with floating point, you'd lose significant digits. Since the approach is to sum (a[i] + b[i+1])(a[i+1] + b[i]) and subtract the sums of a[i]a[i+1] and b[i]b[i+1] in the end to get a[i]b[i] + a[i+1]b[i+1], you may be taking the difference of two large values to get a small value, losing precision.
LLM hype and this submission in particular keep making me think of a lecturer I had for Topics in Large Dimensional Data Processing, circa 2016: as I recall he was enthusiastically adamant that the most important thing, breakthroughs etc., in years/decades to come was going to be faster matrix operations. Anyway, I'm pretty sure I recognise FIP (not FFIP of course) from that course.
I wish I could remember his name, I believe he left academia after my year and went to work in industry, I'd just be curious to see what he's up to now. I'm not saying it was a particularly novel or prescient comment/attitude, we may not have had quite such ML hype but certainly 'big data' was all the rage at the time, it's just something that's stuck in my mind. One of those areas I always meant to study more, just realistically probably never had the mathematical chops for and certainly those I did have atrophied.
Maybe I’m joking, but: our society is just a vehicle for economics at this point, our economy is built around science, our science has mostly been turned into observations about engineering, some time ago we changed all of engineering into differential equations, and differential equations can be solved by discretizing them and doing linear algebra, and most of linear algebra can be done with matrix multiplications (triangular solves and orthonormalizations if you are fancy). All you need is matmul.
> our science has mostly been turned into observations about engineering
You may be joking but that in particular seems pretty astute.
Superficially it seems accurate, and reasonably ascribable to economic forces, fewer concentrations of capital in people (aristocrats) spending it on a hobby interest or academic pursuit of their own - today's equivalents mostly prefer philanthropy (Musk is, I suppose, for whatever else you might think of him, a notable exception - preferring to explore space, AI, etc. not really it seems for personal monetary gain). But I wonder if that is fair, to modern scientists, or is it just that 'all the low-hanging stuff's been done'?
For life sciences need grad students / postdocs to do the grunt work of pipetting, dissected, plating etc. And whatever the equivalent is in chemistry (titration/GC/mass transfer I guess)?
But those tools created by engineers are pretty darn important, and allow plenty of experiments/observations to be performed that were previously out of reach.
I don't think so, this may be the case in your country of course. I may well have recorded it in my notes if I dig them out, but this was a fourth year course and they certainly degraded over the years.
There are a lot of matrix multiplication algorithms out there with a lot of pluses and minuses. It's always a balance of accuracy, runtime, and scaling. This one probably has bad accuracy in floating point.
For everyone discussing the reduced accuracy/numerical stability of the algorithms in floating-point, this is true. But note that the application of the algorithms in the work is explored for fixed-point MM/quantized integer NN inference, not floating-point MM/inference. Hence, there is no reduction in accuracy for that application of it compared to using conventional fixed-point MM.
"Conventional fixed-point MM" is a large suite of algorithms. It is correct that this is a 2x reduction in MULs compared to naive fixed-point matrix multiply, but there is a large body of literature out there with other circuits. This is a cool trick to add to the group.
Inference world is gradually switching from INT formats to FP formats. FP8 is already supported in modern hardware, and FP4 support is coming. In my experiments I get better perplexity in language models with FP4 than with INT4.
How is FP fundamentally different than integers? I've done FPGA programming and it just seems like the programmer has to decide where/when to do the shifting based on the expected range of the data. I'm not sure how this is "natively supported" in hardware.
If you have designed FPUs you should know that FP computation involves a lot more additional operations than just shifting (e.g. rounding, subnormals, and special value handling). That’s why, for example, CPUs use different hardware blocks for INT vs FP computation.
But that’s not the point. The point is, this particular method to speed up matmul is not suitable for FP.
I'm no expert but I suspect this is wrong. To me, this is like saying you don't need to worry about integer overflow because your operations are only working on fixed integers. Really? You don't care if you multiply or add two large numbers and they spill over?
The more appropriate answer, I suspect, is that the numerical precision and stability sacrifices are more than adequate for normal usage.
If I'm wrong about this, I would certainly like to know.
In hardware, you control your integer widths completely, so if you add two 32-bit ints to a 33-bit int, there is no chance of overflow. The same goes for multiplications, etc.
Yeah with shifts you can guarantee no overflow, but you have to decide under what circumstances is avoiding overflow/clipping worth the loss of precision.
Fixed point typically requires alot more programming, but sometimes its worth it if you know what the data ranges are.
I don't know why this answer is getting downvoted. This is absolutely correct.
W. Miller has a paper discussing, under conditions of numerical stability, O(n^3) multiplications is necessary [0]. Any algorithm that gets sub cubic runtime for matrix multiplication, like Strassen's or Coppersmith's, must sacrifice some amount of precision or stability.
The paper cited is about hardware, where there is no accuracy tradeoff because you control the numerical precision completely and use fixed point. In a software implementation, neither is true. There is no chance that you will get the exact same values out of this method that you do out of other FP matmuls.
This repository contains the source code for ML hardware architectures that
require nearly half the number of multiplier units to achieve the same
performance, by executing alternative inner-product algorithms that trade
nearly half the multiplications for cheap low-bitwidth additions, while still
producing identical output as the conventional inner product.
I’ve only glanced at it so someone correct me if I’m wrong, but IIUC this is not a replacement for matrix multiplication but rather an approximation that only gives decent-ish results for the types of linear systems you see in AI/ML. But for that use case it is totally fine?
The concept here goes a bit further and tries to replicate backprop with an external network, arguing that that's probably what the brain actually does.
I'm not seeing the connection. This work is about low-level optimization of matrix multiplication. The repo you linked seems to be about replacing back-propagated gradients with a cheaper estimate. What's the similarity you see between these two?
Correct, I think I mistook it as "use a small neural net to approximate matrix multiplication" instead it seems as "use cheaper replacements of matrix mul without much acc loss".
This feels like a "no free lunch" situation. I would imagine that any time saving in approximating the gradients this way would be lost to needing to train for more iterations due to the loss in gradient accuracy. Is that not the case?
However if this is really the biological analogue of credit assignment, this might scale better than training llms from scratch every time. Even if say it could approx gradients to a certain degree given a new network, normal backprop could further tune for a few epochs or so dramatically reducing overall training costs.
This is very cool and a real interesting read! For those in the comments confused about how this is better, the paper is talking about synthesizing matrix multiplication pipelines in hardware, like an FPGA or ASIC. On a CPU or GPU you won't notice because adds and multiplications take the same amount of time generally, but multiplication units takes up many more transistors, so if you can reduce the circuit complexity you can increase the speed and parallel throughput and reduce power and routing complexity. This approach could be particularly useful for efficient sparse matrix multiplication accelerators.
Another cool way to eliminate multiplication in matrix multiplication is to use different semirings [1]. The Tropical Semiring [2] for example substitutes addition for multiplication and min (or max) for addition. It's still matrix multiplication but with substituted binary operations. The research in this relatively new field of Tropical Algebra [3] is quite active and rich right now, being used for all kinds of optimization problems and in research for optimizing neural networks [4]
. This approach also lends itself to hardware synthesis since most FPGA configurable logical blocks can add/min/max in one clock cycle, whereas efficient multiplication requires fixed dedicated on-chip hardware multipliers.
Another way to efficiently remove multiplications with a different but related semiring is to use a Log Semiring [5]. If you have to multiply chains of probabilities (like Markov chains) then the numbers quickly become very small and floating point loses its accuracy to represent the numbers. By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).
The paper in [4] is absolutely fascinating - I'm very much a neophyte here, but I believe it shows that practically any ReLU network can be represented as a tropical ratio of two tropical polynomials, and thus can be analyzed with geometric principles including visualizations of surfaces. It's been cited in more recent work: https://scholar.google.com/scholar?cites=1003719112553620451... - does anyone know if there's been any significant advancements here?
> By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).
Addition/subtraction in a logarithmic number system is way more expensive than what you would spend on multiplication, especially if you care about correctly rounded results, as the (hardware) LUTs required are rather big.
Finite field log/antilog lookup tables are used for efficient-ish multiplication, similar to addition/subtraction tables used for logarithmic number systems.
I'm surprised this actually works, usually detecting whether to use multiplication or addition is slower than simply using multiplication. Especially if it's massive amounts of work being done in parallel.
> I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.
Is this stated correctly? Because it seems almost meaningless as stated. You start with "for every j, there exists an n such that...". That would mean that for the rest of the statement, n and j are constant. So you are just saying that you can multiply constant sized matrices in constant time. Technically true, but I feel like you are trying to claim something stronger.
Assuming your claim is not equivalent to "matrix multiplication is in O(n^2)", then it is false assuming conventional mathematics (i.e. uncountable sets exist), because j in R is uncountable, and algorithms are countable...
You do not need a different algorithm for each real.
Just take the algorithm that proves the statement for rational p. It proves the statement for all reals bigger than p. (hence having as many algorithms as there are rational (countably many) is enough)
It does seem to be harder to make progress over time though. Maybe it’ll bottom out at j=1/e. I won’t call my guess even a conjecture though, it just happens to be a convenient constant near where the current value is. It would be a funny prank for math to pull on us.
Yeah, we're agreed that j cannot be less than 0.
But your conjecture was about _every j > 0_.
Do you have any specific line of reasoning which suggests that j can be arbitrarily close to 0 (0 is the greatest lower bound)? Why do you not think there's some other specific limit k \in (0, 0.3728596] beyond which j cannot be improved?
This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?
And the diagrams are chaotic and don't really explain anything about why this approach is fast or good. The result is that I'm reluctant to even click-through to the PDF.
If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations, rather than things that may as well be designed to hype people too busy to tell you that you are cranks. It's hard to tell if this is incredibly groundbreaking or just but nothingburger. Sadly I almost feel like that must be an intentional decision motivated by poor merits of work and a desire to exploit AI height. The alternative - which I prefer to believe is the case - is that the author simply needs to revise and better contextualize.
The claim is they’re dropping half the multiplications, so it isn’t doing anything for Big O.
> If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations,
The math explaining how to halve the number of multiplications in the paper (https://arxiv.org/abs/2311.12224) isn’t hard to understand.
You only have to read formulas 2 (traditional matrix multiplication) and 3 to 6.
I think it’s clear it does do what’s being advertised, halving the number of multiplications at the cost of lots of extra additions/subtractions.
They then go on to better vectorize that algorithm. That, as is usual for that, gets looking messy soon.
Thanks, good summary. Regarding numerical stability, the application is for fixed-point arithmetic, and therefore numerical stability is not an issue (the result is identical compared to using the conventional inner-product)
The readme doesn't explain much, but the introduction to the paper itself is quite approachable.
As for whether it's groundbreaking or not ... it's a neat readily-accessible constant-factor improvement for area-constrained fixed-point accelerators. That doesn't change everything overnight, but neither is it nothing. It's nice work.
> This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?
Without wishing to sound elitist I, I don't understand the point of this comment at all. If you don't understand Big O notation enough to know that "half the multiplications" doesn't change it then why are you even asking about it?
I made two really bad misunderstandings which I completely own. I can't seem to edit it, so please let this be my refutation and explanation of my own mistake.
First misunderstanding: I assumed this was a new large matrix multiplication algorithm building on the hype from last week or so where we saw this paper: https://arxiv.org/abs/2210.10173
It is not an algorithm, but a hardware design - a systolic array using roughly half of the silicon area of a baseline design.
2. Assuming that we were talking about an algorithm, I then further assumed that it reduce an algorithm's multiplications by half for some important n. I assumed that it did this by accelerating some critical sub procedure in the baseline algorithm into a more efficient big O class without really changing the multiplicative factor. This is a common way to reduce the number of operations of an algorithm for some fixed n. As a consequence I thought that the author must be being sloppy by not telling us the full big o details of the improvement, and just picking some n where it just so happened that half of the multiplications vanish. That also seemed unlikely to be a consequence of a improvement in the bound of matrix multiplication, given how incredibly slow the progress on matrix multiplication bound has been. So I thought that the author might even be a crank.
But it turned out the sloppy thinking was on me. I was being the crank.
Reading the paper's introduction made it very very clear that we were dealing with a systolic array that reduces silicon area per compute.
Even worse that information is there in the first sentence of the readme as well.
Would a clearer sentence have helped me? Something like:
"We introduce a VHDL hardware design for a systolic array that nearly halves the silicon area of a baseline array, by replacing half of the multiplication-accumulate units (MACs) with simple adder units, exploiting Winograd's 1967 fast inner product formula (FIP)."
I'm honestly not sure, given how bad my mistake was to begin with. Not even the diagrams tipped me off - in hind site they are very obviously hardware block diagrams, but I thought that they were just needlessly complicated algorithmic diagrams! How silly!
I still believe that the readme could be simplified for a general audience of goobers like me. But first and foremost I have to admit that I was being a goober!
Does that help you understand my mistake here? I do understand Big O and why cutting operations by half is typically a constant factor improvement. But apparently I don't understand it well enough to prevent me from retconning a narrative with some very stinky assumptions and then projecting them on to the poor innocent hardware designer. Not very proud about that.
This is an analogy. a^2 - b^2 = aa - bb. This can be factored to (a+b)(a-b). In the first expression there are two multiplies, in the factored version there is only one.
However, from a numerical analysis / accuracy standpoint, evaluating the factored expression can result in loss of precision in the result when a is close to b. This is especially true if you repeatedly and sequentially do a lot of these operations. Loss of precision can be a problem in numeral modeling (like climate simulation) -- long term predictions diverge.
Given that there is a drive to use greatly reduced precision in ML engines, loss of precision might have an effect on how a model performs. Then again, it might not. I haven't read a lot of papers on ML, but I don't recall seeing ones that try to quantify how sensitive a model is to error propagation. (I am making a distinction between tests where the precision is reduced to see where it breaks down v.s. calculating / understanding what the error level actually is in a model)
With LLMs they start showing signs of brain damage once the errors get too high. In my experience it reduces their ability to reason, they stop counting correctly, and they start homogenizing categories, like calling a lemur a monkey. Compare this with quantizing weights, which instead of brain damage leads to ignorance, forcing them to hallucinate more.