Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Matrix Multiplication with Half the Multiplications (github.com/trevorpogue)
310 points by emacs28 10 months ago | hide | past | favorite | 77 comments



This looks pretty cool! What's the catch? e.g. why isn't this already implemented in accelerators, is it really just a forgotten algorithm, or this has some implications on the cost of building the accelerator or else?


It's not just a software algorithm. It's a hardware architecture optimization. To benefit, you have to build hardware that matches the dimensions of the algorithm. That's an expensive commitment.


> you have to build hardware that matches the dimensions of the algorithm

Yes the benefits are realized in custom hardware designs as opposed to software, however, the hardware architectures work for multiplying matrices of arbitrary dimensions by splitting up larger matrices into smaller tiles, then summing up the tile products to form the final larger matrix products (i.e. GEMM)


Not so much in FPGA ... although I'm not sure top end FPGAs would beat Nvidia TPUs even with this algorithm, and even if cost were not a consideration.


IMHO, for fixed-point MM accelerators, there is no catch, I think it's an overlooked algorithm. It's based on an algorithm by Winograd who coincidentally also proposed another unrelated algorithm that later became very popular for CNN acceleration which would take some visibility away from this other algorithm by Winograd... But that is speculative


On the other hand, if you tried it with floating point, you'd lose significant digits. Since the approach is to sum (a[i] + b[i+1])(a[i+1] + b[i]) and subtract the sums of a[i]a[i+1] and b[i]b[i+1] in the end to get a[i]b[i] + a[i+1]b[i+1], you may be taking the difference of two large values to get a small value, losing precision.


[dead]


On a tangent, go is so elegant.


LLM hype and this submission in particular keep making me think of a lecturer I had for Topics in Large Dimensional Data Processing, circa 2016: as I recall he was enthusiastically adamant that the most important thing, breakthroughs etc., in years/decades to come was going to be faster matrix operations. Anyway, I'm pretty sure I recognise FIP (not FFIP of course) from that course.

I wish I could remember his name, I believe he left academia after my year and went to work in industry, I'd just be curious to see what he's up to now. I'm not saying it was a particularly novel or prescient comment/attitude, we may not have had quite such ML hype but certainly 'big data' was all the rage at the time, it's just something that's stuck in my mind. One of those areas I always meant to study more, just realistically probably never had the mathematical chops for and certainly those I did have atrophied.


Maybe I’m joking, but: our society is just a vehicle for economics at this point, our economy is built around science, our science has mostly been turned into observations about engineering, some time ago we changed all of engineering into differential equations, and differential equations can be solved by discretizing them and doing linear algebra, and most of linear algebra can be done with matrix multiplications (triangular solves and orthonormalizations if you are fancy). All you need is matmul.


> our science has mostly been turned into observations about engineering

You may be joking but that in particular seems pretty astute.

Superficially it seems accurate, and reasonably ascribable to economic forces, fewer concentrations of capital in people (aristocrats) spending it on a hobby interest or academic pursuit of their own - today's equivalents mostly prefer philanthropy (Musk is, I suppose, for whatever else you might think of him, a notable exception - preferring to explore space, AI, etc. not really it seems for personal monetary gain). But I wonder if that is fair, to modern scientists, or is it just that 'all the low-hanging stuff's been done'?


For life sciences need grad students / postdocs to do the grunt work of pipetting, dissected, plating etc. And whatever the equivalent is in chemistry (titration/GC/mass transfer I guess)?

But those tools created by engineers are pretty darn important, and allow plenty of experiments/observations to be performed that were previously out of reach.


So, what you're saying is ...

... that the Matrix creates the world around us.

Thanks.


Unless he was a guest lecturer, if the course was for credit, wouldn't his name appear on your official transcript?


I don't think so, this may be the case in your country of course. I may well have recorded it in my notes if I dig them out, but this was a fourth year course and they certainly degraded over the years.


There are a lot of matrix multiplication algorithms out there with a lot of pluses and minuses. It's always a balance of accuracy, runtime, and scaling. This one probably has bad accuracy in floating point.


For everyone discussing the reduced accuracy/numerical stability of the algorithms in floating-point, this is true. But note that the application of the algorithms in the work is explored for fixed-point MM/quantized integer NN inference, not floating-point MM/inference. Hence, there is no reduction in accuracy for that application of it compared to using conventional fixed-point MM.


"Conventional fixed-point MM" is a large suite of algorithms. It is correct that this is a 2x reduction in MULs compared to naive fixed-point matrix multiply, but there is a large body of literature out there with other circuits. This is a cool trick to add to the group.


Inference world is gradually switching from INT formats to FP formats. FP8 is already supported in modern hardware, and FP4 support is coming. In my experiments I get better perplexity in language models with FP4 than with INT4.


How is FP fundamentally different than integers? I've done FPGA programming and it just seems like the programmer has to decide where/when to do the shifting based on the expected range of the data. I'm not sure how this is "natively supported" in hardware.


If you have designed FPUs you should know that FP computation involves a lot more additional operations than just shifting (e.g. rounding, subnormals, and special value handling). That’s why, for example, CPUs use different hardware blocks for INT vs FP computation.

But that’s not the point. The point is, this particular method to speed up matmul is not suitable for FP.


I'm no expert but I suspect this is wrong. To me, this is like saying you don't need to worry about integer overflow because your operations are only working on fixed integers. Really? You don't care if you multiply or add two large numbers and they spill over?

The more appropriate answer, I suspect, is that the numerical precision and stability sacrifices are more than adequate for normal usage.

If I'm wrong about this, I would certainly like to know.


In hardware, you control your integer widths completely, so if you add two 32-bit ints to a 33-bit int, there is no chance of overflow. The same goes for multiplications, etc.


Yeah with shifts you can guarantee no overflow, but you have to decide under what circumstances is avoiding overflow/clipping worth the loss of precision.

Fixed point typically requires alot more programming, but sometimes its worth it if you know what the data ranges are.


I don't know why this answer is getting downvoted. This is absolutely correct.

W. Miller has a paper discussing, under conditions of numerical stability, O(n^3) multiplications is necessary [0]. Any algorithm that gets sub cubic runtime for matrix multiplication, like Strassen's or Coppersmith's, must sacrifice some amount of precision or stability.

[0] https://epubs.siam.org/doi/10.1137/0204009



The document said it outputs the exact same values as the conventional method. There is no accuracy trade off here.


The paper cited is about hardware, where there is no accuracy tradeoff because you control the numerical precision completely and use fixed point. In a software implementation, neither is true. There is no chance that you will get the exact same values out of this method that you do out of other FP matmuls.


For floating point? Are you sure?


Opening statement of README

    This repository contains the source code for ML hardware architectures that 
    require nearly half the number of multiplier units to achieve the same 
    performance, by executing alternative inner-product algorithms that trade 
    nearly half the multiplications for cheap low-bitwidth additions, while still 
    producing identical output as the conventional inner product.


I just looked at the paper: the answer is no, floating point is not supported.


It's not quite forgotten. It kind of lives on in the pseudo-dot product Wegman-Carter authenticators like UMAC. See Section 3 of [1] for context.

[1] https://cr.yp.to/antiforgery/pema-20071022.pdf


I’ve only glanced at it so someone correct me if I’m wrong, but IIUC this is not a replacement for matrix multiplication but rather an approximation that only gives decent-ish results for the types of linear systems you see in AI/ML. But for that use case it is totally fine?


It produces identical/bit-equivalent results as conventional/naive matrix multiplication for integer/fixed-point data types


Perhaps it's less of a hidden gem and more of a spotlight moment.


Man I remembered something similar I had tried working on in 2018, but gave up after all my PhD applications got rejected.

https://github.com/ixaxaar/pytorch-dni

The concept here goes a bit further and tries to replicate backprop with an external network, arguing that that's probably what the brain actually does.


I'm not seeing the connection. This work is about low-level optimization of matrix multiplication. The repo you linked seems to be about replacing back-propagated gradients with a cheaper estimate. What's the similarity you see between these two?


Correct, I think I mistook it as "use a small neural net to approximate matrix multiplication" instead it seems as "use cheaper replacements of matrix mul without much acc loss".

Wellll that means I can give dni another try :D


This feels like a "no free lunch" situation. I would imagine that any time saving in approximating the gradients this way would be lost to needing to train for more iterations due to the loss in gradient accuracy. Is that not the case?


I think that's the reason for its dead end.

However if this is really the biological analogue of credit assignment, this might scale better than training llms from scratch every time. Even if say it could approx gradients to a certain degree given a new network, normal backprop could further tune for a few epochs or so dramatically reducing overall training costs.


Unrelated to the technical discussion but I was wondering what you made that architecture gif with? Looks neat!


I think that image is from the paper and was not created by me. Looks cool indeed!


This is very cool and a real interesting read! For those in the comments confused about how this is better, the paper is talking about synthesizing matrix multiplication pipelines in hardware, like an FPGA or ASIC. On a CPU or GPU you won't notice because adds and multiplications take the same amount of time generally, but multiplication units takes up many more transistors, so if you can reduce the circuit complexity you can increase the speed and parallel throughput and reduce power and routing complexity. This approach could be particularly useful for efficient sparse matrix multiplication accelerators.

Another cool way to eliminate multiplication in matrix multiplication is to use different semirings [1]. The Tropical Semiring [2] for example substitutes addition for multiplication and min (or max) for addition. It's still matrix multiplication but with substituted binary operations. The research in this relatively new field of Tropical Algebra [3] is quite active and rich right now, being used for all kinds of optimization problems and in research for optimizing neural networks [4] . This approach also lends itself to hardware synthesis since most FPGA configurable logical blocks can add/min/max in one clock cycle, whereas efficient multiplication requires fixed dedicated on-chip hardware multipliers.

Another way to efficiently remove multiplications with a different but related semiring is to use a Log Semiring [5]. If you have to multiply chains of probabilities (like Markov chains) then the numbers quickly become very small and floating point loses its accuracy to represent the numbers. By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).

[1] https://en.wikipedia.org/wiki/Semiring

[2] https://en.wikipedia.org/wiki/Tropical_semiring

[3] https://en.wikipedia.org/wiki/Tropical_geometry

[4] https://proceedings.mlr.press/v80/zhang18i/zhang18i.pdf

[5] https://en.wikipedia.org/wiki/Log_semiring


The paper in [4] is absolutely fascinating - I'm very much a neophyte here, but I believe it shows that practically any ReLU network can be represented as a tropical ratio of two tropical polynomials, and thus can be analyzed with geometric principles including visualizations of surfaces. It's been cited in more recent work: https://scholar.google.com/scholar?cites=1003719112553620451... - does anyone know if there's been any significant advancements here?


Whoah, this is what Unified Algebra is all about!

http://www.cs.toronto.edu/~hehner/UA.pdf


So what's the difference between this unified algebra, and universal algebra[0]?

[0] https://en.wikipedia.org/wiki/Universal_algebra


This is what HN is about :)

I understood a fraction but instantly wanted to dive into the topic just after reading such a knowledgeable comment.


> By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).

Addition/subtraction in a logarithmic number system is way more expensive than what you would spend on multiplication, especially if you care about correctly rounded results, as the (hardware) LUTs required are rather big.


> By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).

Isn't this the same approach as in GF(2^x), which has been in use for decades? The only limitation that comes to mind is the field size.


Finite field log/antilog lookup tables are used for efficient-ish multiplication, similar to addition/subtraction tables used for logarithmic number systems.


somewhat related is the number theoretic transform

https://ieeexplore.ieee.org/abstract/document/1451721


I'm surprised this actually works, usually detecting whether to use multiplication or addition is slower than simply using multiplication. Especially if it's massive amounts of work being done in parallel.


Wonder how well it compares to openblas and cublas.


I find it fascinating that this is using a process invented in 1968 and hasn't been used for this purpose until now!


Hey, nobody knew what to do with GF(2^x) up until mid last century either... Oh wait, CS was not really a thing almost up until mid last century...


If you're interested in the mathematical theory behind sub-cubic algorithms for matrix multiplications, you can start from here: https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...

I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.

(Now proven for for 2+j = w = 2.3728596, or j > 0.3728596)


> I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.

Is this stated correctly? Because it seems almost meaningless as stated. You start with "for every j, there exists an n such that...". That would mean that for the rest of the statement, n and j are constant. So you are just saying that you can multiply constant sized matrices in constant time. Technically true, but I feel like you are trying to claim something stronger.


It should simply say:

for any j>0 there exists an algorithm multiplying nxn matrices in time O(n^{2+j}).


You are correct, I apologize for the confusion! :)


Assuming your claim is not equivalent to "matrix multiplication is in O(n^2)", then it is false assuming conventional mathematics (i.e. uncountable sets exist), because j in R is uncountable, and algorithms are countable...


You do not need a different algorithm for each real. Just take the algorithm that proves the statement for rational p. It proves the statement for all reals bigger than p. (hence having as many algorithms as there are rational (countably many) is enough)


Then that’s basically saying there is a O(n^2) algorithm which I dealt with in my first sentence.


Are algorithms on a Real RAM [1] countable? That would mean that there are constants you can't express.

[1] https://en.wikipedia.org/wiki/Real_RAM


It does seem to be harder to make progress over time though. Maybe it’ll bottom out at j=1/e. I won’t call my guess even a conjecture though, it just happens to be a convenient constant near where the current value is. It would be a funny prank for math to pull on us.


Predicting that this holds for any j > 0 seems rather bold. Would you care to share your intuition why you think that's the case?


Two matrices with size NxN each can be multiplied naively with the schoolbook algorithm in O(N^3).

It's clear that the algorithm needs at least O(N^2) because to access each element of the matrices once, you need a double for loop, which is O(N^2).

  for i in rows
      for j in cols
          # do something with element matrix1 [i, j], matrix2[i, j],...

so it has to be j >= 0


His question was: what is the reasoning behind there existing an algorithm running in time n^2+epsilon for really small epsilon.


Yeah, we're agreed that j cannot be less than 0. But your conjecture was about _every j > 0_. Do you have any specific line of reasoning which suggests that j can be arbitrarily close to 0 (0 is the greatest lower bound)? Why do you not think there's some other specific limit k \in (0, 0.3728596] beyond which j cannot be improved?


This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?

And the diagrams are chaotic and don't really explain anything about why this approach is fast or good. The result is that I'm reluctant to even click-through to the PDF.

If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations, rather than things that may as well be designed to hype people too busy to tell you that you are cranks. It's hard to tell if this is incredibly groundbreaking or just but nothingburger. Sadly I almost feel like that must be an intentional decision motivated by poor merits of work and a desire to exploit AI height. The alternative - which I prefer to believe is the case - is that the author simply needs to revise and better contextualize.


> What is the Big O run time on this?

The claim is they’re dropping half the multiplications, so it isn’t doing anything for Big O.

> If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations,

The math explaining how to halve the number of multiplications in the paper (https://arxiv.org/abs/2311.12224) isn’t hard to understand.

You only have to read formulas 2 (traditional matrix multiplication) and 3 to 6.

I think it’s clear it does do what’s being advertised, halving the number of multiplications at the cost of lots of extra additions/subtractions.

They then go on to better vectorize that algorithm. That, as is usual for that, gets looking messy soon.

My main concern would be numerical stability.


Thanks, good summary. Regarding numerical stability, the application is for fixed-point arithmetic, and therefore numerical stability is not an issue (the result is identical compared to using the conventional inner-product)


The readme doesn't explain much, but the introduction to the paper itself is quite approachable.

As for whether it's groundbreaking or not ... it's a neat readily-accessible constant-factor improvement for area-constrained fixed-point accelerators. That doesn't change everything overnight, but neither is it nothing. It's nice work.


> This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?

Without wishing to sound elitist I, I don't understand the point of this comment at all. If you don't understand Big O notation enough to know that "half the multiplications" doesn't change it then why are you even asking about it?


I made two really bad misunderstandings which I completely own. I can't seem to edit it, so please let this be my refutation and explanation of my own mistake.

First misunderstanding: I assumed this was a new large matrix multiplication algorithm building on the hype from last week or so where we saw this paper: https://arxiv.org/abs/2210.10173

It is not an algorithm, but a hardware design - a systolic array using roughly half of the silicon area of a baseline design.

2. Assuming that we were talking about an algorithm, I then further assumed that it reduce an algorithm's multiplications by half for some important n. I assumed that it did this by accelerating some critical sub procedure in the baseline algorithm into a more efficient big O class without really changing the multiplicative factor. This is a common way to reduce the number of operations of an algorithm for some fixed n. As a consequence I thought that the author must be being sloppy by not telling us the full big o details of the improvement, and just picking some n where it just so happened that half of the multiplications vanish. That also seemed unlikely to be a consequence of a improvement in the bound of matrix multiplication, given how incredibly slow the progress on matrix multiplication bound has been. So I thought that the author might even be a crank.

But it turned out the sloppy thinking was on me. I was being the crank.

Reading the paper's introduction made it very very clear that we were dealing with a systolic array that reduces silicon area per compute.

Even worse that information is there in the first sentence of the readme as well.

Would a clearer sentence have helped me? Something like:

"We introduce a VHDL hardware design for a systolic array that nearly halves the silicon area of a baseline array, by replacing half of the multiplication-accumulate units (MACs) with simple adder units, exploiting Winograd's 1967 fast inner product formula (FIP)."

I'm honestly not sure, given how bad my mistake was to begin with. Not even the diagrams tipped me off - in hind site they are very obviously hardware block diagrams, but I thought that they were just needlessly complicated algorithmic diagrams! How silly!

I still believe that the readme could be simplified for a general audience of goobers like me. But first and foremost I have to admit that I was being a goober!

Does that help you understand my mistake here? I do understand Big O and why cutting operations by half is typically a constant factor improvement. But apparently I don't understand it well enough to prevent me from retconning a narrative with some very stinky assumptions and then projecting them on to the poor innocent hardware designer. Not very proud about that.


It´s actually fairly clear


Not to everyone. If it's clear to you, you could helpfully explain it.


This is an analogy. a^2 - b^2 = aa - bb. This can be factored to (a+b)(a-b). In the first expression there are two multiplies, in the factored version there is only one.

However, from a numerical analysis / accuracy standpoint, evaluating the factored expression can result in loss of precision in the result when a is close to b. This is especially true if you repeatedly and sequentially do a lot of these operations. Loss of precision can be a problem in numeral modeling (like climate simulation) -- long term predictions diverge.

Given that there is a drive to use greatly reduced precision in ML engines, loss of precision might have an effect on how a model performs. Then again, it might not. I haven't read a lot of papers on ML, but I don't recall seeing ones that try to quantify how sensitive a model is to error propagation. (I am making a distinction between tests where the precision is reduced to see where it breaks down v.s. calculating / understanding what the error level actually is in a model)


With LLMs they start showing signs of brain damage once the errors get too high. In my experience it reduces their ability to reason, they stop counting correctly, and they start homogenizing categories, like calling a lemur a monkey. Compare this with quantizing weights, which instead of brain damage leads to ignorance, forcing them to hallucinate more.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: