> “Exponent two” refers to the ideal speed — in terms of number of steps required — of performing one of the most fundamental operations in math: matrix multiplication. If exponent two is achievable, then it’s possible to carry out matrix multiplication as fast as physically possible.
Can anyone clarify what "as physically possible" means here? Are there physical systems that perform matrix multiplication in O(n^2) ? (I feel silly even asking that about a physical, non-computation system)
I remember there were applications of optical computation, including a startup that wanted to accelerate deep network inference through optical computation: they had a way to do matrix multiplication using a projector (for the data matrix) and a 3D printed lens (for the network weights).
I don’t think they’re actually intending to refer to physics here—they could have said “as fast as theoretically possible” instead.
A square matrix with n rows and columns has n^2 entries, so if you need to look at all the entries in two matrices to compute their product, then you need to do O(n^2) work just to look at all the relevant entries.
The only way you could get around this is if you know many entries are 0, or identical, or have some other structure such that you don’t need to look at them separately.
I don't think they're talking about total operations here, as pointed out in the article this is just about the number of multiplications.
Every matrix multiplication can be boiled down to multiplying rows by columns. This is definitionally required for matrix multiplication, so that is the lower bound for how many multiplications are required. One multiplication for every final element in the n x n matrix, or n^2.
They are talking about total operations, but since big O notation ignores constant factors, and they explicitly ignore the addition steps (because for large n multiplication dominates) it is, in effect, only looking at the multiplications.
To answer the GP, the minimum amount of work necessary is to write a number for each element in the array, which scales by n^2 since there are n^2 elements. It is of course only possible to beat this if you can make assumptions like “the diagonal of my matrix is nonzero and everything else is zero”, but that’s not general matrix multiplication anymore, that’s a specialized sparse matrix routine. For a dense, “normal” matrix, O(n^2) is the absolute best we could hope for.
We're saying approximately the same thing, though I'd argue that your phrasing is slightly more confusing to people that aren't super familiar with Big O notation (and therefore don't exactly understand what is meant by "ignores constant factors" and similar).
Reads from the original two matrices and writes to the final matrix aren't at all the crux of the problem, so I don't think it makes sense to focus on either of those things in your explanation. The goal of the problem is to reduce the number of multiplications which are required to find the values that you then write (of which there will always be n^2 writes). There will also be 2n^2 reads, but likewise that isn't relevant because it's not what you're trying to optimize.
The relevant idea here is that the naive method of multiplying a matrix involves n multiplications for every row-column pair, for a total time complexity of n^3 (n rows, n cols, n multiplications). The goal here is to bring down that third number, from n down to something >= 1, with the holy grail obviously being 1 itself.
It's not at all obvious to me why the number of multiplications must be at least n². Can you prove that? (To be clear, n² multiplications is most certainly the lower bound, but I can't come up with a good argument as to why.)
This is why the trivial "you must read n² entries from each input matrix and write n² entries in the end" is helpful: It's an extremely obvious lower bound on the asymptotic complexity. Sure, it's not what we're trying to reduce (and getting < n² multiplications would be a ridiculous result), but anyone can see that it's true.
If matrix A has a single nonzero column (ai) in the first column and B has a single nonzero row (bj) in the first row, then A * B computes all pairwise products ai * bj.
To be honest I’m not sure whether that really proves it. Does computing a single multiplication require a multiplication?
By definition, a matrix multiplication takes rows and multiplies them by columns. For a pair of square matrices, where there are n rows in A and n columns in B, you (by definition) need to multiply them by each other at least once, otherwise it's not matrix multiplication, it's matrix... something else.
> By definition, a matrix multiplication takes rows and multiplies them by columns.
> you (by definition) need to multiply them by each other at least once
Why does multiplying a row with a column require at least one multiplication? "By definition" it takes n multiplications (!), but in the context of matrix multiplication we can do it with less - possibly with O(1) multiplications. And it seems neither you nor me can come up with a good argument why it can't take less, despite discussing it for a while.
Which is why "you need to write n² entries" is used as an argument for the O(n²) bound, even if it has nothing to do with the number of multiplications.
You need n^2 time just to read the two matrices, so I always thought of that as a natural lower bound. (Of course, we're talking about multiplications eventually here, so may not be exactly what they meant...)
An (n x n) square systolic array can perform matrix multiplication in time O(n) (not counting I/O, which takes O(n^2) without more specialty hardware) -- this is how TPUs work. Simulate the same algorithm on a single processor and it takes O(n^3). Use the same hardware to compute larger matrix products and it only gives a constant factor speedup.
The other answers are good (it takes O(n^2) time just to read two matrices, so you can't possibly multiply them more quickly than that), but they leave out a critical part of that analysis that I want to highlight just in case it would have otherwise bitten you in the future:
You _do_ actually have to read all the entries in both matrices -- if you know K-1 out of K values then in general you still can't uniquely compute the output of matrix multiplication. Consequently, any algorithm relying on only some subset of the entries is at best an approximation.
If you have more information (e.g., that one or both of the matrices is sufficiently sparse) then you can potentially do better.
I think if you could somehow start computing the resultant matrix elements as soon as you read a row/column from the input ones, you could reach their "physically possible" limit.
A couch expert on computer architecture here, but a large enough systolic array could be used to achieve their "physically possible" limit? [0]
New CUDA GPUs have been coming with these tensor cores that are just systolic arrays. Google's TPU are the same.
Could someone with more knowledge on systolic arrays comment on whether a large systolic array can achieve this?
You cannot reduce the number of operations by waving a magic architecture want. Hence the no. But you can do operations in parallel rather than in series. Same number of operations, but it takes less time.
So if a systolic array is built for the same scale as your problem, your bottleneck does indeed become how quickly you can send data to/from that dedicated system.
It's badly phrased. What they mean is that we know that it will take at least n^2 operations, and that it would be nice to find an algorithm that is actually that fast.
Can anyone clarify what "as physically possible" means here? Are there physical systems that perform matrix multiplication in O(n^2) ? (I feel silly even asking that about a physical, non-computation system)
I remember there were applications of optical computation, including a startup that wanted to accelerate deep network inference through optical computation: they had a way to do matrix multiplication using a projector (for the data matrix) and a 3D printed lens (for the network weights).