Hacker News new | past | comments | ask | show | jobs | submit login
Matrix multiplication inches closer to mythic goal (quantamagazine.org)
310 points by yboris on March 23, 2021 | hide | past | favorite | 217 comments



If you are interested in how those tensor methods work, the authors have another very nice paper that gives a great introduction https://arxiv.org/abs/1810.08671 It is actually not that different from Strassen after all. It just uses a bit more notation to help construct more clever ways of saving multiplication.


I’m more interested why saving multiplications at the cost of more add/subtract helps overall?

On all modern CPUs their speed is the same for float numbers. Specifically, they have 4 cycles of latency on Skylake, 3 cycles of latency on Zen2, throughput (for 32-byte vectors) is 0.5 cycles for both CPUs.


AFAIK the small improvements to asymptotic complexity achieved in the last decade or so don't really mean better practical performance due to higher constant factors.

That probably isn't directly the point either. Perhaps new methods could lead to practical performance improvements in the long run, but my impression is that the recent improvements have been more of theoretical than practical interest.

Even the asymptotic improvements are getting rather minuscule, though.


Right, the recent developments are interesting mainly because they're exploring the limits of the current method. Hopefully this will eventually inspire a sufficiently new approach to get substantial gains.


You actually save additions too.

If you look at Strassens algorithm [1] you see that it splits each matrix in 4 blocks. Naively you can do 8 multiplications of those blocks, as well as 4 matrix additions, and recover the full matrix multiplication.

If T(n) is the number of operations (additions as well as multiplications) for an nxn matrix multiplication, we can write the equation T(n) = 8T(n/2) + 4(n/2)^2. If we solve this, we end up with T(n) ~ n^3, which is of course the time complexity of naive matrix multiplication.

However, with Strassen you end up with 7 smaller matrix multiplications and 18 matrix additions. The equation we get now is T(n) = 7T(n/2) + 18(n/2)^2. We end up with T(n) ~ n^(2.8).

We see that when we save _matrix_ multiplications we eventually end up saving both elementary multiplications _and_ additions.

[1]: https://en.wikipedia.org/wiki/Strassen_algorithm


My guess: general software multiplication is usually costlier than addition: it will pay off if you go for divide-and-conquer and your items are themselves matrices, or if they are arbitrary precision integers, or fractions, or elements of any ring really.


On the off chance an expert reads this comment and feels like answering, is there a consensus opinion (or even just your opinion) on what the best result is likely to be? Do people think O(n^2) is actually possible?

Edit: shoulda read to the end, it says that Strassen says no.


The impression I had is that many experts believe that no single algorithm A can achieve O(n^2), but there exists a sequence of algorithms A_i that achieve O(n^{2+d_i}), with d_i>0 getting arbitrarily small, but with the algorithms getting arbitrarily long and the constant overhead factors out front getting arbitrarily large.

Unfortunately this was just word of mouth, and I might be misremembering.


This is what I choose to believe. Even if it is not true I will still believe it because it is so satisfying.


Yea. I've always wondered whether it was motivated purely by intuition about elegance, or whether there is another computational task that has been shown to have this feature. (The finite-length proof that an infinite tower of algorithms exist without them actually being specifiable in finite length would I guess require some Gödelian jiu jitsu?)


Hmm, I guess I don't get it. I feel like if you have such a sequence of algorithms A_i, you can "string them together" to get a single algorithm A which has the property that it has complexity O(n^{1 + epsilon}) for any epsilon > 0. Specifically, suppose we know that A_i requires time at most C_i + D_i n^{1 + 1/i}. Then there exists N_i such that, for any n >= N, C_i + D_i n^{1 + 1/i} >= C_{i+1} + n^{1 + 1/(i+1)}, and so can't we specify A by saying "if the input has size n where N_i <= n < N_{i+1}, perform A_i"?


An algorithm has finite length, but an infinite sequence of algorithms need not. For algorithm A to actually carry out "if the input has size n where N_i <= n < N_{i+1}, perform A_i", it needs to be able to generate the code for each of the infinite number of A_i's.


I see, thanks. Yeah that kind of thing is definitely above my pay grade.


The choice of A_i now depends on n. The complexity of your construction is O(n^{2 + 1/argmax_i { N_i <= n}), not O(n^2).


Oh I wasn't claiming it's O(n), just O(n^{1+epsilon}) for any epsilon > 0. Indeed, the function inside your O is dominated by n^{2 + epsilon} for any epsilon > 0.


The problem is that each algorithm likely has a constant factor much larger than the previous. This does not matter for the runtime complexity of any individual algorithm but when you string them together the time complexity explodes.


I'm pretty confident the construction above does not depend on any bound on the growth rate of C_i (or D_i).


I guess this would be true even if the correct complexity was say,

O(n^2 * ln(n)).

Waxing theoretically, there is some integer-valued function f such that optimal matrix multiplication is

Ω(n^2 * f(n)).

We can ask whether f is O(1). I think your statement is equivalent to saying that f is o(n^ε) for any ε > 0.


> I guess this would be true even if the correct complexity was say, O(n^2 * ln(n)

Yes you're right it would be true. We can see this by just constructing a sequence of algorithms A_i that are implementations of the putative optimal O(n^2 * ln(n)) algorithm except artificially slowed down by ever smaller polynomial overheads. But the hypothesis I'm describing above holds that no such O(n^2 * ln(n)) algorithm exists.

> Waxing theoretically, there is some integer-valued function f such that optimal matrix multiplication is...

The hypothesis I'm describing is that no optimal algorithm exists. Instead, the claim is that there is an infinite sequence of ever better algorithms (specified by ever longer sets of instructions, and with ever larger constant overheads).


This is roughly my guess about P=NP as well, and I suspect the problem will be "solved" when someone figures out how to formalize that progression satisfactorily. Does this sound right, or does anyone know a solid counter-argument?


Knuth holds such a view:

> Don Knuth: As you say, I've come to believe that P = N P, namely that there does exist an integer M and an algorithm that will solve every n-bit problem belonging to the class N P in nM elementary steps.

> Some of my reasoning is admittedly naïve: It's hard to believe that P ≠ N P and that so many brilliant people have failed to discover why. On the other hand if you imagine a number M that's finite but incredibly large—like say the number 10 3 discussed in my paper on "coping with finiteness"—then there's a humongous number of possible algorithms that do nM bitwise or addition or shift operations on n given bits, and it's really hard to believe that all of those algorithms fail.

> My main point, however, is that I don't believe that the equality P = N P will turn out to be helpful even if it is proved, because such a proof will almost surely be nonconstructive. Although I think M probably exists, I also think human beings will never know such a value. I even suspect that nobody will even know an upper bound on M.

https://www.informit.com/articles/article.aspx?p=2213858


That doesn't feel quite the same. My intuition is based on noticing that many instances of NP-complete problems can be solved quickly, and that adding more sophisticated heuristics can put more instances in that category. I'm just guessing that the set of actually-slow instances can be made arbitrarily small if the program is long enough, i.e. has enough special cases for classes of problem that are amenable to faster solutions. It has the same "long enough program" flavor, but I don't think the equality will be quite exact.


Strassen has shown that the bound is higher than n^2 using a tensor rank bound. [1] The exact exponent bound is usually called omega in the literature, and there are good estimates for the value of omega. It is also related to other quantities like Grothendieck constant [2].

[1] http://www.thi.informatik.uni-frankfurt.de/~jukna/Strassen-V...

[2] https://arxiv.org/pdf/1711.04427.pdf


I'm no expert, but IMHO it feels like a problem that might have O(n^2 log(n)) time complexity. That would be more satisfying than some arbitrary non-integer exponent.


Get that n out of the exponent -- that is waay too slow. Maybe you meant O(n^3*log(2))? That exponent is definitely a constant.


I don't know if he edited his comment, but as it is currently written the log is not in the exponent under standard order of operations.

Also, putting a constant factor like log(2) inside big-O notation is meaningless.


Constant factors in exponents are _not_ meaningless. One of those cases where the rules-of-thumb bite you.


I didn't claim that. The log(2) term is multiplied, and that's implied by my use of "factor", which in this context is always understood to mean multiplicative factor.


You're correct. I think this comes down to confusion on the operator precedence. I read it as O(n^(3 log 2)) because otherwise it's kind of very obviously not correct (no way it's n^3, that's already known), but should have stated that, because according to usual notation/precedence it's not that.


We have better than that. I think that what was meant is O(n^2*log(n))


That isn't just what's meant, that's also what's written.


The way that it was written didn't disambiguate whether or not the log is part of the exponent. Putting the * in there disambiguated that.


It doesn't. The multiplication is implied and it would have the same precedence as explicitly putting * there. So either both are ambiguous or neither is (and it isn't, the poster was just mistaken).


I'm not sure why you think that it is implied.

I know that I have a lot of math background, and my brain read that log(n) as part of the exponent. I figure that if I read it that way, you have to expect that others will as well.


No matter how you read a^b log(c), the multiplication is implied. The original confusion is whether it's to be read as (a^b)log(c) or a^(b log(c)). a^b*log(c) simply removes the implied multiplication, but it does not change whether that should be read as (a^b)*log(c) or a^(b*log(c)) and thus does not resolve the original confusion.


There are related problems in computer science -- e.g. how quickly can you multiply a pair of 65536x65536 extended double precision matrices together. Note that each matrix consume ~68GB of memory, so you need (probably) at least 200GB to store the two input matrices and the output matrix. So far, so good.

Now you want to see how fast it can really go if you are prepared to spend (say) $100k on the hardware (e.g. 64 times as many CPUs with some fancy interconnect fabric). How about spending $100M? It turns out that moving all this data around is also quite costly.....


>65536x65536 extended double precision matrices together

For perspective, that would be akin to multiplying every MAC address with every other MAC address upon the entire IPv4 range.


MAC addresses are 48 bits and unrelated to IPv4. Multiplying two 65536×65536 matrices naively in O(n^3) time would be about the same number of operations as processing each possible MAC address.


That's not a problem in C.S. That's a problem in engineering.


Until hardware can do everything free and instantly, making computations cheaper and faster will be a core problem in CS. Everyone enjoys when the hardware gives free gains though :)


C.S is not concerned with the fastest speed that off the shelf hardware can multiply two matrixes measured in seconds. You must be thinking of Linus Tech Tips.


It is concerned with the lowest complexity method of doing so, which is directly related.


Tangentially, not directly. Algorithms become different beasts when implemented on real-world hardware.


I said they are directly related, not that they are equivalent. The entire reason programmers have the notion of time complexity is to understand how algorithm's execution scales. The reason we drop constants and constant scale factors is because they typically don't make a big difference in real world execution times. That's a little more than tangentially related.

Saying time complexity has nothing to do with hardware is like saying speed limits have nothing to do with cars. Hardware constraints are the entire reason we developed the notion of time complexity.

I'm honestly very confused as to how someone (I assume is) engaged in programming could think the relationship is tangential.


You are arguing a different point, C.S. is not concerned with measuring performance in seconds. Do you agree or disagree?

Now that we are on the same page, of course the asymptotic complexity of a method matters, but some of the lowest complexity methods actually have huge constants and cannot be used until you get to scales that dwarf most usages. This means that the connection is tangential. The current method of matrix multiply used in all your software is NOT the lowest asymptotic complexity method. Look it up.


parallelism and data locality are also cs topics


Does anyone know how matrix multiplications are implemented in CPUs GPUs or ML asics? Are there optimizations used or do the optimizations take too much circuitry and thus the n^3 method is still used?


It's typically still the O(N^3) method that's implemented in OpenBLAS, BLIS, cuBLAS, MKL, etc. There are some high performance implementations of Strassen's Algorithm with a fixed level of recursion that start to perform much better as the matrix size gets large (see the work from UT Austin's BLIS). From my understanding for the more complex algorithms the hidden constant simply grows too large to be worth it on conventional hardware.

As another poster commented, these are O(N^3) in work, not necessarily the wall clock time you see due to parallel speedup and cache effects, but will scale as O(N^3) once you're at a size past all of these. These implementations are highly tuned and optimized for hardware, much much more so than the naive 3 loop implementation. The predictable and 'easy' access patterns make the simple algorithm hard to beat.


I think its also important to mention that these results are for the multiplication of two general dense matrices and full floating point accuracy. If your matrix is structured (sparse, sparse in the Fourier domain, low rank, H, HSS, etc) you can usually exploit that structure to break up the multiplication and reduce the work complexity dramatically. (These rely on building blocks of the dense general matrix results)


Another issue of both practical and theoretical importance is whether these asymptoticaly-faster algorithms are numerically stable. Skimming the article very quickly (and not having looked at any original sources), this doesn't seem to be addressed (yet).


Strassen is reasonably numerically stable (although not as good as the naive algorithm), and everything beyond Strassen is extremely unstable. This is due to a trick that many of these algorithms employ: you technically only need to do multiplication of 0-1 matrices to fully solve the matrix multiplication problem in theory, and so having an algorithm which computes the entries of the matrix with additive error 0.1 is sufficient to exactly solve the problem (just round entries to the nearest integer). As you can imagine, this means your algorithm can only give O(1) bits of precision unless you employ this reduction to the 0-1 case first.

To understand why this even happens, let's say we want to compute the expression A*B + B*A for matrices A and B. One way we can do this is to compute the products A*B and B*A naively: two multiplications are needed. A trickier way to do this is to introduce a parameter x: we will instead compute A*A and (x*A + B/x)*(x*A + B/x) = x^2 A*A + (A*B + B*A) + x^-2 B*B. Thus (x*A + B/x)*(x*A + B/x) - x^2 A*A = (A*B + B*A) + x^-2 B*B: if x is large enough the second term vanishes and we can employ the rounding trick from before. Now this still needed two multiplications, but here one of our multiplications was A*A. If later we needed to compute A*C + C*A in the algorithm, we could then do that in only 1 additional matrix multiplication by repeating the trick. A more sophisticated version of this algorithm underlies all known approaches for matrix multiplication beyond w << 2.8.


Thank you for the very nice explanation!


At least for strassen, the result is slightly less stable (increases the condition number by a constant factor). I think higham shows that any sub-cubic algorithm must have done conditioning issues, but that in practice they're not that bad for most matrices.


Note that GPUS are almost always multiplying 4x4 matrices. Sure they multiply many matrices together but each is 4x4. The 4^3 complexity isn't at all an issue (64 steps) and the constant time for some of the n^2.x methods is high.


Presumably this only applies to GPUs being used for graphics, not GPUs being used for ML?


This is not true. What is your source?


I guess OP meant that each tensor core in the latest NVIDIA GPUs fundamentaly performs 4x4 matrix multiplication [0].

[0] https://nla-group.org/2020/07/21/numerical-behaviour-of-tens...


That's also not true. They do 4x4, 8x8, and 16x16.


Which is fine for the point being made here. There's no issue with n^3 for small matrices.


For GPUs, it's actually much faster than O(n^3) because computing each entry in the result matrix is independent. Hence, the problem is embarrassingly parallel in a way.

I don't know how to use O() notation for GPUs but it should be something like O(n^2/k^2) where K is the tile size [0].

Also lower memory bandwidth becomes a bottleneck here. So there is a lot of optimizations done on how to transfer from CPU to GPU and then within GPU to efficiently query the matrices.

[0]https://docs.nvidia.com/deeplearning/performance/dl-performa...


> For GPUs, it's actually much faster than O(n^3) because computing each entry in the result matrix is independent. Hence, the problem is embarrassingly parallel in a way.

That's not how you do big-O with parallel systems.

When talking a parallel algorithm, you often perform two different big-O calculations:

* Breadth: "If a single-processor" executed the parallel algorithm, how long would it take? (If the best sequential algorithm and best parallel algorithm are both within a constant-factor of breadth, its called "work-efficient"). Also called "Total Work"

* Depth: "If an infinite number of processors" executed the parallel algorithm, how long would it take? Also called "Length of the longest dependency chain".

A naive matrix-multiplication would be O(n^3) breadth. I don't know the depth calculation unfortunately...

---------

For example, Bitonic Sort is one of the best GPU-sorting algorithms, but its O(n * log^2(n)) Breadth... asymptotically slower than sequential's O(n*log(n)) time.

But Bitonic Sort is often used because its Depth is O(log^2(n)). Which means that "with enough processors", you approach O(log^2(n)) sorting time, which is pretty amazing.

Note: "GPU Mergepath" is pretty damn amazing if you've never seen it before. O(n) breadth to perform a merge operation (part of merge-sort), so for large arrays, Mergesort wins as long as you use the GPU Mergepath algorithm to perform the individual merge steps.

But if you have more processors than data (lol, it happens: Vega64 supports 163,840 threads of execution: Occupancy 10 x 4096 physical cores x innate hardware parallelism over 4 clock ticks), Bitonic sort is an obvious choice.


For depth:

If I have at least n^2 processors, I can send a row and column to each processor, which can compute the inner product in linear time. So O(n^2) time to coordinate the work, and O(n) to actually do it.


Hmmmm. Your O(n^2) step seems unnecessary to me. Therefore: the answer is O(n) depth for the naive case.

-----------

Processors are generally assumed to be self-numbered. Ex: processor #100 knows its processor#100.

    xloc = (threadIdx.x % matrix_width);
    yloc = (threadIdx.x / matrix_width);

    performMatrixCalc(xloc,yloc);
Therefore, only O(n) time depth apparently. O(log(n)) to broadcast matrix_width to all processors, which seems to be the only communication needed to organize the calculation.


Nice, thanks!


O notation doesn't apply to hardware in that way. The naive algorithm is still O(n^3) on a GPU. Remember that O notation is concerned with behavior at the limits and ignores constant factors. Parallel hardware can only provide a constant speedup for problems of arbitrary size, hence it does not show up in O notation.


I want to clarify that your first sentence likely means something like "O notation is insensitive to hardware in the way you're suggesting." Not "you can't apply O notation GPUs"


Yes correct. Edited to clarify. Another way of phrasing it -- O notation is insensitive to the specific implementation of a computing model.


On hardware, for fixed-size problems, O notation applies in the form of circuit size, which maps, unsurprisingly, to actual circuit size.


r/confidentlyincorrect

complexity is wrt a particular model of computation - a turing machine. turing machines are sequential (NTMs aren't). run time on a parallel model is not necessarily a constant off from run time on a sequential model.


Snark aside, GP is asking about GPUs. GPUs are not nondeterministic Turing machines.


like the other comment alludes to it's not constant it's a function of block size and the size of the matrices

http://www.ijcee.org/vol9/949-E1621.pdf

you edited your comment. it said what's the speed up on GPUs (which i provided).

>GPUs are not nondeterministic Turing machines

for small problem size that's exactly what they are (or any multicore cpu for that matter)

>The other is to imagine that the machine "branches" into many copies, each of which follows one of the possible transitions. Whereas a DTM has a single "computation path" that it follows, an NTM has a "computation tree".

https://en.wikipedia.org/wiki/Nondeterministic_Turing_machin...


An NTM is one which can run arbitrarily many branches in parallel. So parallel processors are not NTMs, since they can only run a fixed number of branches in parallel.

It's true that for small problems they are indistinguishable, but in the context of discussing big O notation that is irrelevant.

For the purposes of computing asymptotic time complexity, whether the algorithm is run on a 1 core system or an M core system is usually irrelevant.


> for small problem size that's exactly what they are

O notation does not apply to small problems. It is strictly concerned with asymptotic behavior.


you're completely missing the point (again). the point is that complexities (even asymptotic) of sequential algorithms don't apply to parallelizations of those algorithms.


Only for infinite number of parallelized processors?


Which, alas, is how we get kids at Harvard chasing a couple decimal points on a dead end path instead of doing something useful. I'm actually a bit perplexed as to why anyone thinks extending the laser method to improve the fourth decimal point is worth the effort. No one will ever implement this thing, and no one believes it actually brings us closer to an actual solution to the exponent 2 conjecture. So seems entirety like a way for phd students to cut their teeth, perhaps? But ultimately not much more helpful than finding the next digit of pi.


> No one will ever implement this thing, and no one believes it actually brings us closer to an actual solution to the exponent 2 conjecture.

It brings us closer more than anything I've done, that's for sure.

I agree with your sense of taste about which problems are personally interesting, and likely to be significant in my lifetime. But I still think it's cool that there are theoretical developments like this. Refining bounds is a cool way to see theoretical progress quantitatively.

I think also it's more like discovering what pi is than trying to find the next digit. We know a lot more about pi than we do about the lowest achievable exponent.


What's needed are new methods, though. I saw some really interesting work ten years ago using group Fourier transforms to attack the problem. I think it didn't end up working out, but was fundamentally more interesting than another extension of the laser approach.

One of the major failure modes of academia is that students are generally not very good at picking problems, and can end up following their advisors down a blind alley. The underlying question here is absolutely worthwhile, but spending a year extending laser will have no impact. It's like you're trying to dig to the center of the earth and someone hands you a shovel: do you take it and start digging, or walk away and try to find/invent an oil drill?


Faster larger more accurate models of the top off my head.


Laser improvements are galactic algorithms, though, and don't give you that at all.


And even if they were practical to run, the time spent implementing this improvement probably wouldn't ever pay off.


Just because you can do a finite number of operations at the same time, it doesn't mean you've changed O().


However, the number of additions required to add those sets of matrices is much closer together. Generally, the number of additions is equal to the number of entries in the matrix, so four for the two-by-two matrices and 16 for the four-by-four matrices.

Maybe my linear algebra is REALLY rusty, but I that doesn't seem right to me.

For two 4x4 matrices, each element in the output matrix will be the sum of four products. So each of the 16 output elements will require 3 additions, and you'll need a total of 48 additions.

I'm not sure where they're getting "16 additions" from.

To multiply two nxn matrices according to the textbook method, I think you'll need n^3 multiplications and n^3-n^2 additions.

I don't doubt that the multiplication time dominates the algorithm. Maybe this is just a nitpick.


They're counting the additions needed to add the matricies, not to multiply them. The point being that matrix addition is order n^2 so even if you need to do a large number of matrix additions in some intermediate calculation, they don't hurt you as n gets large if you're trying to speed up matrix multiplication.


That doesn't seem to be what they are saying. Here's the first sentence of the paragraph that I had originally quoted:

As matrices grow larger, the number of multiplications needed to find their product increases much faster than the number of additions.

The whole paragraph is talking about the multiplications and additions needed to find the product of two matrices.


it could refer to matrix multiplication and matrix addition of smaller matrices during the algorithm (iirc most fast multiplication algorithms start by splitting the matrix in pieces)


> “Exponent two” refers to the ideal speed — in terms of number of steps required — of performing one of the most fundamental operations in math: matrix multiplication. If exponent two is achievable, then it’s possible to carry out matrix multiplication as fast as physically possible.

Can anyone clarify what "as physically possible" means here? Are there physical systems that perform matrix multiplication in O(n^2) ? (I feel silly even asking that about a physical, non-computation system)

I remember there were applications of optical computation, including a startup that wanted to accelerate deep network inference through optical computation: they had a way to do matrix multiplication using a projector (for the data matrix) and a 3D printed lens (for the network weights).


I don’t think they’re actually intending to refer to physics here—they could have said “as fast as theoretically possible” instead.

A square matrix with n rows and columns has n^2 entries, so if you need to look at all the entries in two matrices to compute their product, then you need to do O(n^2) work just to look at all the relevant entries.

The only way you could get around this is if you know many entries are 0, or identical, or have some other structure such that you don’t need to look at them separately.


I don't think they're talking about total operations here, as pointed out in the article this is just about the number of multiplications.

Every matrix multiplication can be boiled down to multiplying rows by columns. This is definitionally required for matrix multiplication, so that is the lower bound for how many multiplications are required. One multiplication for every final element in the n x n matrix, or n^2.


They are talking about total operations, but since big O notation ignores constant factors, and they explicitly ignore the addition steps (because for large n multiplication dominates) it is, in effect, only looking at the multiplications.

To answer the GP, the minimum amount of work necessary is to write a number for each element in the array, which scales by n^2 since there are n^2 elements. It is of course only possible to beat this if you can make assumptions like “the diagonal of my matrix is nonzero and everything else is zero”, but that’s not general matrix multiplication anymore, that’s a specialized sparse matrix routine. For a dense, “normal” matrix, O(n^2) is the absolute best we could hope for.


We're saying approximately the same thing, though I'd argue that your phrasing is slightly more confusing to people that aren't super familiar with Big O notation (and therefore don't exactly understand what is meant by "ignores constant factors" and similar).

Reads from the original two matrices and writes to the final matrix aren't at all the crux of the problem, so I don't think it makes sense to focus on either of those things in your explanation. The goal of the problem is to reduce the number of multiplications which are required to find the values that you then write (of which there will always be n^2 writes). There will also be 2n^2 reads, but likewise that isn't relevant because it's not what you're trying to optimize.

The relevant idea here is that the naive method of multiplying a matrix involves n multiplications for every row-column pair, for a total time complexity of n^3 (n rows, n cols, n multiplications). The goal here is to bring down that third number, from n down to something >= 1, with the holy grail obviously being 1 itself.


> with the holy grail obviously being 1 itself.

It's not at all obvious to me why the number of multiplications must be at least n². Can you prove that? (To be clear, n² multiplications is most certainly the lower bound, but I can't come up with a good argument as to why.)

This is why the trivial "you must read n² entries from each input matrix and write n² entries in the end" is helpful: It's an extremely obvious lower bound on the asymptotic complexity. Sure, it's not what we're trying to reduce (and getting < n² multiplications would be a ridiculous result), but anyone can see that it's true.


If matrix A has a single nonzero column (ai) in the first column and B has a single nonzero row (bj) in the first row, then A * B computes all pairwise products ai * bj.

To be honest I’m not sure whether that really proves it. Does computing a single multiplication require a multiplication?


By definition, a matrix multiplication takes rows and multiplies them by columns. For a pair of square matrices, where there are n rows in A and n columns in B, you (by definition) need to multiply them by each other at least once, otherwise it's not matrix multiplication, it's matrix... something else.


> By definition, a matrix multiplication takes rows and multiplies them by columns.

> you (by definition) need to multiply them by each other at least once

Why does multiplying a row with a column require at least one multiplication? "By definition" it takes n multiplications (!), but in the context of matrix multiplication we can do it with less - possibly with O(1) multiplications. And it seems neither you nor me can come up with a good argument why it can't take less, despite discussing it for a while.

Which is why "you need to write n² entries" is used as an argument for the O(n²) bound, even if it has nothing to do with the number of multiplications.


I don't think the sentence is very good.

You need n^2 time just to read the two matrices, so I always thought of that as a natural lower bound. (Of course, we're talking about multiplications eventually here, so may not be exactly what they meant...)


An (n x n) square systolic array can perform matrix multiplication in time O(n) (not counting I/O, which takes O(n^2) without more specialty hardware) -- this is how TPUs work. Simulate the same algorithm on a single processor and it takes O(n^3). Use the same hardware to compute larger matrix products and it only gives a constant factor speedup.


The other answers are good (it takes O(n^2) time just to read two matrices, so you can't possibly multiply them more quickly than that), but they leave out a critical part of that analysis that I want to highlight just in case it would have otherwise bitten you in the future:

You _do_ actually have to read all the entries in both matrices -- if you know K-1 out of K values then in general you still can't uniquely compute the output of matrix multiplication. Consequently, any algorithm relying on only some subset of the entries is at best an approximation.

If you have more information (e.g., that one or both of the matrices is sufficiently sparse) then you can potentially do better.


I think if you could somehow start computing the resultant matrix elements as soon as you read a row/column from the input ones, you could reach their "physically possible" limit.

A couch expert on computer architecture here, but a large enough systolic array could be used to achieve their "physically possible" limit? [0]

New CUDA GPUs have been coming with these tensor cores that are just systolic arrays. Google's TPU are the same.

Could someone with more knowledge on systolic arrays comment on whether a large systolic array can achieve this?

[0] https://medium.com/intuitionmachine/googles-ai-processor-is-...


The answer is no but sort of.

You cannot reduce the number of operations by waving a magic architecture want. Hence the no. But you can do operations in parallel rather than in series. Same number of operations, but it takes less time.

So if a systolic array is built for the same scale as your problem, your bottleneck does indeed become how quickly you can send data to/from that dedicated system.


O(N^2) is the time you need to write down the correct answer, provided that it takes 0 time to calculate each cell in resulting matrix


Provided it takes O(1) time not 0.


It's badly phrased. What they mean is that we know that it will take at least n^2 operations, and that it would be nice to find an algorithm that is actually that fast.


Related to separate paper on using a different method for linear equation solving from a few weeks ago https://www.quantamagazine.org/new-algorithm-breaks-speed-li... seems likely if you can use a different method for solving linear equations, it may be that some matrix multiplication can also be improved using a different method.


The article doesn't specify, and I don't have the prior knowledge to understand the paper; Is there a coded implementation of this in some language or is it purely a mathematical description of how theoretically multiplication can always be done with slightly fewer multiplications?


Strassen's algorithm is still the fastest to be widely implemented. The Coppersmith-Winograd variants that have achieved better asymptotic complexity are all galactic algorithms: https://en.wikipedia.org/wiki/Galactic_algorithm.


The last time I paid attention to this stuff ten years ago, the operations were 'galactic' to get to these theoretical bounds. Constant terms such that you'd need matrices with n = many billions to start to be competitive with normal matmul or Strassen.


Strassen's is sometimes competitive in the thousands!

https://www.cs.utexas.edu/users/flame/pubs/FLAWN79.pdf

You're right though afaik all the other algorithms are still 'galactic' and completely impractical


Crap. I should have refreshed the page. Didn't see you already posted what I did.


I think the way it'll go is: it'll be implemented in Fotran, BLAS and so on until it percolates to numpy / CUDA so mortals like us can use it.


Agree. I don't think we'll notice, but the next gen of CUDA will be better because of it.


These particular algorithms tend to have large Constance factors, so you need astronomically large matrices before they become faster than algorithms with worse exponents but better constant factors.


Hopefully the matmul implementation can check matrix size and pick the algorithm to use from that. I think some may already do this.


Is anyone looking into the memory accesses for the operation?

Fifty years ago multiplication was vastly slower than RAM access. Nowadays it's pratically free compared to even getting data out of cache ...


Nobody is remotely looking at running these algorithms in the real world. These are very much asymptotic complexities, and to draw benefit from even the 1990 version you'd have to literally use galactic input sizes. None of this is about making the matrix multiplications on your PC faster.

https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...


Wikipedia says Strassen's algorithm is useful for n > 100 and is implemented in BLAS [0]. AFAIK Coppersmith-Winograd is more of a theoretical advance.

Edit: ah, we're probably talking about the same thing. Strassen was 1970.

[0] https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...


Shouldn't that first graph have the axes the other way round? Time (date) should be on the x-axis, and the measurement on the y-axis? As it is, it looks terribly confusing to me.


If n^2 is possible then it would be possible for a given n, say 4. Can't it be shown by example or exhaustive search that this is or isn't possible for 4?


O(n^2) is a scaling factor: what happens as n->infinity.

An optimal arrangement for n==4 already exists. Both AMD MI100 and NVidia Volta / Ampere perform 4x4 FP16 matrix multiplication in a single assembly language statement.

An 8x8 matrix is just an arrangement of four 4x4 matricies (!!!!). As such, you can define a 8x8 matrix multiplication as a 4x4 matrix multiplication over 4x4 matricies. This recursive relationship is often key to how they manage to get faster-and-faster. Getting to O(n^2.8), or O(n^2.6,) etc. etc.


Since we're looking for a O(n^2) algorithm, not specifically n^2 operations, this actually doesn't help. You could for example need 1000n^2+1000000 operations, which would look terrible if you only look at n=4, but you're golden if n=10000000 (and even if there's no practical n where you win, it's still theoretically very interesting).


I am doing Andrew Ng’s ML Coursera course with Matlab.

I’ve now got the crazy desire to see matrices and vectors built into every language in a clean and succinct way.


One of the unexpected things is Dartmouth BASIC from 1964 had matrix primitives, despite being an otherwise primitive language. MS never took up the feature, but I know Wang BASIC did on their 2200 family of machines.

For example:

    10 DIM A(3,3),B(3,3),C(3,3)
    20 MAT READ A
    30 DATA 1,-5,7, .45,23,9, -5,-1,0
    40 MAT B=A:REM assignment
    50 MAT B=ZER:REM zero array
    60 MAT B=CON:REM all 1s
    70 MAT B=IDN:REM identity
    80 MAT B=TRN(A):REM transpose
    90 MAT B=(2)*A:REM scale
    100 MAT B=INV(A),D:REM invert array, determinant in scalar D
    110 MAT C=A*B:REM matrix multiplication
    120 MAT PRINT C:REM prints something close to an IDN array


The first version of Dartmouth basic ran on a system with a whopping 20kB of memory (8k words at 20 bits each. See https://en.)wikipedia.org/wiki/GE-200_series)

The GE-635 (https://en.wikipedia.org/wiki/GE-600_series) that it ran on most of its life could have 4 CPUs, had protected memory and floating point hardware, and supported at least 200 kilowords of memory (http://www.bitsavers.org/pdf/ge/GE-6xx/CPB-371A_GE-635_Syste...)

At 36 bits/word, that’s 900 kilobytes.

The first version of Microsoft basic (https://en.wikipedia.org/wiki/Altair_BASIC) ran in 4 kilobytes (leaving about ¾ kilobytes for programs and data), so it’s understandable that it lacked some features (it didn’t even have else clauses in if statements, had SIN, but not COS or TAN, didn’t support integer or string variables, etc))


No question it would not make sense in that original 4K or even 8K BASIC, but MS produced BASICs with much larger memory budget and the matrix ops never appeared (as far as I know).


They added tons of stuff that made business sense to them. Apparently, matrices as built-in types didn’t. Nowadays, the language is powerful enough to implement them as a library (e.g. https://www.freevbcode.com/ShowCode.asp?ID=3682. Operator overloading became possible at some time, too. https://docs.microsoft.com/en-us/dotnet/visual-basic/program...)

It’s not as if they don’t know matrices exist. Excel has matrix operations (https://support.microsoft.com/en-us/office/mmult-function-40...)


Believe it or not, engineers were heavy users of BASIC in the first few years. And FORTRAN was created with the notion of matrix multiplication, so it was just natural that BASIC implementations supported matrix multiplication.


My grandfather was a construction engineer. I remember visiting his office around 1990 just before he retired.

His only computer was a Commodore 64 and he had written all the software himself in BASIC over the years, just translating a lifetime of professional experience into these little interactive programs on floppies.

Maybe the modern equivalent would be to use Excel.


Yeah, this is something pretty useful. I also add to even go beyond and go full relational, that is what I'm trying with https://tablam.org


Convenient matrix algebra is one of the selling points of Julia.


I'm really hopeful that Julia will continue to gain adoption.

That said, I would love to have good data science library support (bindings to stuff like BLAS/LAPACK, Torch, Stan, etc) in Idris.

Imagine vector and matrix dimensions checked at compile time. Probabilities guaranteed to be between 0 and 1, checked at compile time. Compile-time enforcement that you are handling "nil/null" data inputs correctly when reading data from a database or CSV file. Writing a web scraper with linear types to help catch bugs & performance leaks. Maybe even proving theorems about the math code you're writing ("is this a valid way to express X"?).

Maybe not ideal for rapid exploratory model development, but it'd be pretty damn cool for writing data science tooling/libraries and "production" ML code in my opinion.


Not that it has bindings to other tools, but it sounds like Dex[1] would be relevant to your interests!

[1]: https://github.com/google-research/dex-lang


Looking forward to Dependent Haskell for linear algebra, and seeing what existing stuff can be retrofit.


It's a main reason why Fortran still has such a hold on the SciComp/HPC community.


Fortran is also optimized and debugged over decades of testing.


Yeah those algorithms are numerically tested, haven't changed in years, and really fast. You can certainly get to convenient matrix manipulation in C++, but then your compile times will increase significantly.


Can the fortran implementations do auto-vectorization on simd hardware? or map to gpu using cuda-like library?

not rhetorical, genuine question.



Fortran is a fully-vectorized and parallelized language at all levels, from instructions to shared and distributed parallelism (corresponding to one-sided MPI communications). This has been the case for at least 3 decades if not more. There are mature GPU support for Fortran. NVIDIA and PGI are currently implementing the standard parallel features of Fortran via GPUs.


Yes, in fact all the original research for those things was done in Fortran compilers.


You can have a fast implementation and fast compilation time with DLang


Performance might be worse in c++


The base type in R is a vector. A number such as 3.14 is simply a vector of length 1. So functions are vectorized by default (as long as you don't do any operations that are explicitly not vectorizable).

Once you get your head around everything being a vector, R can be a lot of fun.

I love Julia but I wish I didnt need to explicitly broadcast functions across a vector.


I was annoyed by that at first, but eventually came to appreciate how Julia's semantics for when broadcasting is or is not required for operations on scalars, vectors, and matrices map to the underlying math. I.e., for an N x N matrix `A`

  5 * A
does not require broadcasting because multiplication of a matrix by a scalar is mathematically well-defined, whereas

  5 + A
(assuming you want element-wise addition) does require broadcasting, i.e.,

  5 .+ A
because addition of a vector and a matrix is not mathematically well-defined.


I do understand why they do it, and numpy can play a bit fast and loose which can be confusing, but I miss reading elegent vector-by-default R code.

The Julia approach also has some drawbacks, I was writing some code the other day and forgot to broadcast exp() across a matrix. exp(A) and exp.(A) gives very different results but it's not immediately obvious looking at code, especially when the code is a mixture of broadcasted and native matrix function calls which dots peppered all over the place.


Well, again I'd say that's just like the above examples. The exponential of a (square) matrix is indeed a very well defined object [1] with semantics perfectly consistent with the exponential of a scalar. On the other hand, broadcasting exp over a matrix is a very different operation, which doesn't really have any meaning in a linear algebra sense.

I understand it's an easy mistake to make when you're used to R or Numpy, but I'd say it's their fault you're making this mistake, not Julia's.

[1] https://en.wikipedia.org/wiki/Matrix_exponential


If you re-read my comment you will see I am not criticizing the language, it makes mathematical sense.

I was pointing out one downside of the broadcast syntax, is that a typo can lead to non-obvious errors in calculations. R avoids this with a expm() function.

Don't get me wrong, I love using Julia, it makes turning maths into code easy, and this is not even close to a deal breaker.


Sure, but my point is that this cuts both ways.

E.g. I could just as easily write

"one downside of autovectorization, is that a typo can lead to non-obvious errors in calculations. When I want the exponential of a matrix, I might accidentally write exp(M) and get unwanted vectorization. This is a typo that can lead to non-obvious errors in calculations. Julia avoids this with a exp.(M) syntax."

I think the potential for typos is symmetric here between julia and R. I think what's asymmetric though is that Julia's broadcasting is more extensible, less and hoc and more general than R's auto-vectorization since it can be done over arbitrary containers and you can control precisely what calls broadcast.

______________________________

Sorry if this comes off as confrontational, that's not my intention. I just think this is a very interesting and fruitful corner of language design.


Yes, have you tried Julia? It is pretty much how you describe. There are other things that I don't like about that language, but the matrix part is really great


Just finished the course and couldn't agree more. Everything seemingly is faster with matrices!


Beyond matrices and vectors, it would be nice to have generalized (typed) tensors. What you can do with matrices alone is really somewhat limited.


From a data structure perspective: You mean like an n-dimensional array? That’s what Matlab does. Matrices (2D arrays) are just a special case of array in Matlab.


I suspect by the usage of the phrase "typed" they mean something like that, but where the axes/dimensions/whatever you call them are typed - ie. you can't naively take a product along the time dimension of A with the batch dimension of B.


xarray seems to fit the bill. As long as you're ok with runtime axis/dimension/type checking.

https://xarray.pydata.org/en/stable/


I've explored this space a bit, and I don't think the perfect solution is here yet. In my view, static checking should be the goal. It's harder to iterate quickly when your network dynamically fails 30 layers later due to some edge case where you forgot to take a transpose. Definitely much better than it silently succeeding if the dimensions happen to be the correct size though!

https://github.com/deepmind/tensor_annotations seems promising. I've developed a mypy plugin for pytorch that does similarly off of the "Named Tensor" dynamic feature (which isn't well supported yet), but haven't released it yet.

I'm also excited by the ways in which including this in some sort of "first-class" way could make tensor semantics simpler, which I'm assuming xarray allows? Many of the operations enabled by http://nlp.seas.harvard.edu/NamedTensor are quite nice and similar ideas can let you write more generic code (ie. code that works automatically for on tensors with and without a batch dimension)


Have you looked at languages like J or APL?


> no, no, no, no, no

Confirmed Strassen thinks the lower bound is n^2.32192809489.


log2(5)???


How would one calculate the $$$ value of exp2 multiplication? Eg if one finds a way, would it be worth keeping it a secret and creating a business around it?


First of all, you couldn't. The "fastest" multiplication algorithms for Matrix multiplication, which are roughly O(n^2.3), aren't used in practice. They're "galactic algorithms" (a term I just now learned), meaning that the size the problem would need to be for them to overtake the more common O(n^2.8) algorithm is too large for today's computers. O(n^2) might be theoretically valuable but practically useless.

Second of all, algorithms aren't patentable. The only way to keep it exclusive would be to keep it secret. I can't think of any examples of a private company making a major scientific or mathematical breakthrough and then keeping it secret long enough to profit off of it.


> Second of all, algorithms aren't patentable.

Algorithms absolutely are patentable because they are computing "machines." It's mathematical equations that aren't patentable, because they're closer to laws of nature.


> algorithms aren't patentable

I wish. The RSA cryptosystem was patented and under license until 2000 in the US, even though the math wasn't secret.


in the time spent improving matrix multiplication by a tiny incremental amount, many gigawatt hours have been spent making existing CPUs, GPUs, and TPUs do matrix multiplication using far less efficient algorithms. It's unclear this work would really be important in production computing (I understand the theoretical desire) compared to time spent making other things work better.


> It's unclear this work would really be important in production computing

It most certainly won't [0]. But neither will something like 99% of mathematical research right now.

[0]: See first example in https://en.wikipedia.org/wiki/Galactic_algorithm


I was under the impression the Strassen method was rarely used because it has a number of drawbacks such as harder to parallelize, uses more memory, has bigger constants for complexity, less stable, etc.

It's not really designed to be optimal for the current hardware, but just to lower the number of steps. It could be more useful in the future though.


are there any non-galactic (practical) algorithm faster than strassen (for non-small dense matrices) ?


quanta magazine has been the best find for me this year!


Man, Quanta magazine is a gift. I love that there is a publication with a significantly higher reading level assumed than most pop science content, but not written for industry insiders, it’s just great.

I pitched them a paper magazine a few years ago, because I wish I could leave these around for my kids to flip through, but no joy yet..


You now mentioned something important: older generations have left behind journals, magazines, books, all kinds of documents that represent their generation. When our generation is gone, there will be very little physical traces left. Someone will have to search archive.org or similar to find anything.


That said, archive.org is accessible to anyone with an internet connection. This is good for those who don't have access to private libraries or family that have science magazine subscriptions.

Future generations will have a different experience than those in the past. Theirs will be slightly more fair


It's the function of historians to do the archeology and summarise for a broader audience.


Maybe, but one of the things that frustrates me is that I know picked up a lot of things because of the books my dad had on the shelves for example. Of course he showed me many of them, but I discovered many more myself because they were there in front of me.

My Kindle books etc. are not visible to my son in the same way. Spatial, in your face, visibility matters in transferring culture.


But you may be slightly sad that a newer generation, growing with shelves full of books right in front of their eyes, is never seen picking up and leafing through a single one of them. They were into some children's books on their way to being teenagers, but now they read and watch videos on their phones, and only read on paper for school.

Then you get over it. They're whip smart and doing things their way, trying to force them into yours probably would not work anyway.

Mostly over it.


It's not universal. I recently discovered that some weird behaviors in my kid were because he was emulating Richard Feynman, having discovered Surely You're Joking on the basement bookshelf.


The thing is, it's not about forcing it, but about creating the opportunity. Maybe it wouldn't happen, but it certainly won't happen if the books aren't there


It's the function of historians to do the archeology and summarise for a broader audience

Those summaries will be written in a similarly ephemeral format. A book in a library can sit there and be rediscovered in a hundred years. A scroll or an engraving in a thousand. But we create things that have evaporated in a mere decade.


I’ve been thinking of backing up a few dozen megabytes of my most important files in like QR codes or maybe something simpler like dual-parity data matrix with instructions in plain text and printed out FORTRAN code. Printed on long lived (waterproof?) paper with a laserjet. Along with plain text printouts of text and pictures.

Even thought of etching that on stainless plates or something.

Could be scanned with a smartphone or even decoded by hand (include an ASCII lookup table).


Maybe, but most people won’t have the equivalent of boxes in the attic containing their or their family memorabilia


We may be a lot like the much older generations that left behind only a handful of prized possessions.

I have a wedding photograph printed on non-biodegradable material (not intentionally for preservation, it's decorative), and perhaps that will eventually be the only tangible evidence of me in the same way that wedding photographs are all I've seen of some relatives born before 1900.


How many generations are even left before we don't have anymore humans?

I'd wager we're 20 - 30 years out from regenerative cloning, and 50 - 75 out from reproductive cloning. We could start that process today if we weren't so squeamish.

ML is coming to a head, and we'll potentially have machines that outclass us in basic tasks in 25 - 50 years. Within 200, I'd say the human experiment is at an end.

What then? Either we evolve along with our machines, we become one with the machines, or the machines take over.

We're living in the last stages of the human species right now, and not many people think to contemplate that and integrate it into their worldview.

Climate, space exploration, etc. is all incredibly human-centric. Fighting for resources, defining legacies, ... I mean, it's what we as humans do. One day it'll just suddenly end, and it's probably only a handful of generations until the pieces fall into place.


> ML is coming to a head, and we'll potentially have machines that outclass us in basic tasks in 25 - 50 years.

Just this line makes me skeptical of the rest of your thinking as well... what I've seen of ML application leaves me strongly convinced that every usage of ML will need human supervision and override control. Every case of "can't reach a human to get this corrected" is a call for boycotting / suing to force a human override channel for any automated system.

The "human experiment" is not an experiment, it is the fundamental reality, and no automation will ever truly replace human thoughts plus emotions plus inherent irrational priorities / preferences / biases.


I'm trying to get rid of my news addiction (so far failing spectacularly...), And one of my plans would be to have a couple of sources that do not push a constant stream of crappy news, but occasional selected high quality articles. At the moment I have Quanta magazine and The Economist Leaders on my RSS reader news section. Does anyone have any improvements or additions to this list - the niche of the articles can be almost anything, as long as it is only a few per week and quality is top notch?


Some news sources on my Atom reader:

Aeon https://aeon.co

Undark Magazine https://undark.org

The Conversation https://theconversation.com/global

Curie https://www.tidningencurie.se/en/home/ (based in Sweden)


http://nautil.us (don't let their broken https fool you into thinking they're not excellent!)


Firefox's `dom.security.https_only_mode` disagrees with you. As far as it's concerned Nautilus doesn't exist. ;-)

Edit: actually it upgrades to HTTPS without complaint! Something might be wrong with your software/internet.


Ahh, it seems something was going on on my side causing Firefox to complain about insecure elements.


As of now, Nautilus redirects to https on Chrome.


> and The Economist Leaders.

That's fine but that will give you an extremely limited vision of the world. The technocratic, slightly anarcho-capitalistic, American-knows-best, social liberal world view.If you choose that way at least go to https://www.project-syndicate.org/, with a little wider spectrum (With actual good economists and not the stenographer of the day) Too bad the redesign is totally crap.


I think Quanta magazine is directly funded by Jim Simons, the billionaire mathematician who founded Renaissance Technology.


Note that's not the breitbart funding Renaissance founder. That's the other one.


Yup. It’s affiliated with the simons foundation too I think?


I agree wholeheartedly! If you don't know of it already, I also highly recommend http://nautil.us


While you can't get hard copy magazine issues, they do have anthologies of their works up to 2018 in book form [0][1].

[0] https://www.amazon.com/dp/0262536358/

[1] https://www.amazon.com/dp/026253634X/


Did you consider print on demand ? Some of them seem to have specialized on magazines?


> love that there is a publication with a significantly higher reading level assumed than most pop science content

Well I am not sure about that.As it happens with most pop-sci publications the quality has taken a nose-dive.

For example:

> A new thought experiment indicates that quantum mechanics doesn’t work without strange numbers that turn negative when squared.

I am not saying is a "bad" magazine per se, all I am saying is virtuable indistiguinshable from your SciAm, New Scientist and Discover. Some good articles and plenty of fluff.

The level you want is in other places like:

-https://www.ams.org/notices

- https://physicstoday.scitation.org/journal/pto (free access to the 73 year archive)

- https://inference-review.com/issues (Ignore the 2014 crackpot articles)


but he says that he enjoys this publication?


He can decorate his walls with printouts of the articles for all I care. The point I made is that the magazine is not longer a high-level general scientific magazine (what OP thinks about that publication), it has become a new Scientific American.


What is with that banner image? Time on the Y axis? Factor on the X? Some mythical line-fit of factor over time that doesn't match the data?

Great article though.

Yuck.


I felt the same about that picture, the original source is less pretty but much easier to read: https://commons.wikimedia.org/wiki/File:MatrixMultComplexity...


It made me think of the progression of men's 100m dash times. I was really hoping it would line up better. I think long jump might have been a better choice for a spurious correlation graph.

https://imgur.com/a/5quyLWC


I think I cracked it: it lags about 30 years behind the long jump progression

https://imgur.com/a/YHvXq7G


I definitely agree, but thinking about it I wonder if it was simply to get a "gut reaction" to the graphic - that things are getting better and we are reaching the limit of what is possible (in the context of matrix multiplication).

It's easier to get that feeling at a glance when the graphic is increasing left-to-right. With normal conventions the graphic would be decreasing. Which of course, that's what it is doing (and the goal). But the "gut reaction" to a decreasing graph is "things are doing worse". The gut reaction to graph increasing is "things are doing better".

I could be completely off-base, but that's the only reason I can think of to eschew normal axis conventions.


So put time on x axis and exponent on y, but make the exponent decrease from bottom to top. They've already reversed the x axis here, so they could've just done that for y instead.


Yep, that would be another option and perhaps a more suitable one to achieve the same effect -- assuming my guess as to "why" is even correct in the first place.


>What is with that banner image?

You beat me to it.


Came here for exactly that. WTF?


A graph with time on the Y axis ... a.k.a why make things simple when you can instead confuse 80% of your readership with a simple mirror trick.


Yea this surprised me at first - especially with the slope to n^2 seeming to indicate that the Y axis quantity remaining was known and measurable.

Additionally the quote below it:

> Mathematicians have been getting closer to their goal of reaching exponent two — meaning multiplying a pair of n-by-n matrices in only n2 steps — but will they ever reach it?

seems particularly odd when the height of those steps is irrelevant in the graphic - if the entire graph was just a flat plane going from n^3 to n^2 it would signify the "best" or at least fastest innovation, the "steppedness" of it and especially steep steps, indicate lags in progress rather than great leaps.


This is a common convention for annotating historical timelines. Even completely unrelated fields like Relativity routinely places time on the y-axis.

Please take your snark elsewhere.


It is kind of weird and unintuitive. This one here, posted by another commenter, is much much more understandable: https://commons.wikimedia.org/wiki/File:MatrixMultComplexity...


It is only unintuitive if your mental representation of time flows to the right.

We all create a mental model of time at some point in our life, and then stick to it - often without ever talking about it to anyone.

But the representations vary substantially between people. Many people might have a timeline where the past is on the left and the future is on the right - maybe along the reading direction. But for some, future expands to the left. For others, future expands in front of them while the past is behind them (quite inconvenient).

It can make for a great party conversation to explore each others‘ timelines.

If there are parties, that is.


Are there areas of the world where that isn't the taught approach to the visualization of time? I've lived in only a small variety of countries but that's been pretty consistently taught from what I've seen. Some people may mentally model it in a different way but society seems to have settled on that specific approach and so most of the graphs we'll see and publish will follow it.


I have not seen this being taught. At least where I‘m coming from you had to figure this out on your own. And I guess most people settle to draw the arrow of time in reading direction. I wouldn‘t be surprised if future is to the left in Arabic countries.


The WP image has an unfortunate offset on the Y axis, giving the impression that we're much closer to optimal than we actually are.


We don’t know this. I think a lot of people expect that there’s an O(n^(2+eps)) algorithm for any epsilon > 0, but that’s conjecture; we might already be at the optimum.


Well I happen to agree with OP, and I found your "big dick, thug of this street" energy off-putting.So there is that.



Huh, it's been along time since I heard the name Josh Alman. Last time I heard about him he was cheating in academic trivia. Looks like he's managed to contribute more positively to society since then.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: