Hacker News new | past | comments | ask | show | jobs | submit login
Integer multiplication in time O(n log n) [pdf] (archives-ouvertes.fr)
225 points by throwawaymath on March 25, 2019 | hide | past | favorite | 67 comments



For whoever wondering about the practicality of this algorithm, no, at least for now. Constants aside, the algorithm currently requires an enormous cutoff for the recursion (anything below that should be handled by traditional algorithms):

> Let `n0 := 2^(d^12) >= 2^4096`, and suppose that we wish to multiply integers with n bits. For `n >= n0` we will describe a recursive algorithm that reduces the problem to a collection of multiplication problems of size roughly n^(1/d). We will show that this algorithm achieves `M(n) = O(n log n)`, provided that `d >= 1729`. (p. 33)

2^(1729^12) ~= 10^(2.15 * 10^38). The authors do note the possibility of much lower cutoff:

> In this section we outline a number of such modifications that together reduce the constant to `K = 8 + epsilon`, so that the modified algorithm achieves `M(n) = O(n log n)` for any `d >= 9` (rather than `d >= 1729`). (p. 39)

Here 2^(9^12) has about 85 billion decimal digits. Much smaller, but still too big to be practical.


There are practical problems in pure mathematics that require multiplication of numbers with 85 billion digits. For example, polynomial multiplication can be reduced to integer multiplication, and certain problems, such as computing congruent numbers or class numbers of quadratic number fields can be reduced to that. Naturally the authors of the paper do not make any claims about the algorithm being practical though.


Note that number of digits is actually 10^(85 billion), not 85 billion.


Yeah, the recent pi calculation amassed 31.4 trillion decimal digits (or about 10^15 binary digits) of the computation load, which would definitely include the integer multiplication of numbers with the comparable size among others. I'm not sure though, because the currently widespread algorithm (Schönhage-Strassen) is practically within two orders of magnitude from O(n log n) [1] and the previous record holder (Fürer's) had been proved extremely infeasible. Also note that 85 billion digits do not relate to the constant itself: it is the lower bound that the new algorithm can make difference, otherwise it reduces to the base case. I'm not really qualified to determine the expected constant of this algorithm though, so it might actually prove feasible.

[1] Its time complexity is O(n log n * log log n). The double logarithm is already slowly growing; log log 2^(10^15) ~= 34 for the reference. At this stage the constant is much more important than the time complexity itself. y-cruncher, a record-setting pi computation software, actually has a set of proprietary algorithms [2] optimized for modern hardwares.

[2] http://www.numberworld.org/y-cruncher/internals/multiplicati...


> Yeah, the recent pi calculation amassed 31.4 trillion decimal digits (or about 10^15 binary digits) of the computation load

But that was to be cool, not for any practical problem.

They needed 156 TiB of storage to hold that singular number, and that happened to be all-solid state storage ( because the majority of the time was spent on I/O, believe it or not!! )

That is to say: a faster CPU wouldn't really make the computation much faster. You need faster storage when you're dealing with numbers that big.

Double-precision floats are what people need most of the time. Crypto-guys need 2048+ or 4096+ byte numbers or something around those sizes for cryptography purposes. I'm not sure if large numbers are really used anywhere else.


If you do agree that mathematical problems can be considered practical, many prime-hunting computations do intend to resolve conjectures (Seventeen or Bust [1] was a good example), thus solving practical problems. I of course think that that pi calculation is nothing more than GCP advertisement ;-)

[1] https://en.wikipedia.org/wiki/Seventeen_or_Bust


Or for the audience here, 2^(9^12) is a number that takes roughly 100 gigabytes to store.

Edit: oh wait, 2^(9^12) is the number of bits. So we are actually talking about numbers of the size 2^(2^(9^12)).

“Too big to be practical” is a serious understatement here.


On the other hand, in practice, log log n is basically constant so we have had a practical n log n algorithm for a while :)


This very recent paper shows a matching (conditional) lower bound of \Omega(n\log n): https://arxiv.org/abs/1902.10935 (the hardness is from a conjecture in network coding).


Very interesting. I was just wondering what the lower bound could be (beyond O(n) which seems fairly obvious, even though as a hobbyist I'd be hard pressed to even prove that).


I'm making my way through uni currently. I've taken a few CS classes that have touched briefly on O notation and I am currently in Calc 2. I understand bits and pieces of this but I am lost in most of the paper. What courses should I be taking to better my understanding of papers of this sort? Or alternatively, are there any online resources that could help me work my way through research papers like these?


You can learn all of these ideas on your own (by e.g. going through textbooks and doing a significant proportion of the exercises), but the guidance of a course / expert is pretty helpful.

To understand computational complexity, take a course with a title like “theory of computation” or similar. https://en.wikipedia.org/wiki/Computational_complexity_theor...

To understand linear maps, tensor products, etc., take a course (or 2–3 courses) in linear algebra. To understand various matrix decompositions, take a course in numerical linear algebra. https://en.wikipedia.org/wiki/Tensor_product https://en.wikipedia.org/wiki/Cholesky_decomposition

To understand the FFT and convolutions, take a course in signal processing, maybe after a course in ordinary differential equations. https://en.wikipedia.org/wiki/Fast_Fourier_transform https://en.wikipedia.org/wiki/Convolution

To understand the theory of polynomial rings, take a course in abstract algebra. https://en.wikipedia.org/wiki/Polynomial_ring

To understand numerical approximations and error propagation, take a course in numerical analysis. https://en.wikipedia.org/wiki/Numerical_analysis

Courses in discrete math, algorithms, and complex analysis would also be helpful.


Thank you so much! I'll definitely be looking into courses in these subjects. I can learn a language on my own, but theory is something that I personally need someone to help walk me through.


Note that really understanding and grasping all of that can easily take some years, as a full-time student, when you are taking all these courses.


>but the guidance of a course / expert is pretty helpful.

From my peers I get the feeling that this is underappreciated in the tech world where many successful developers skipped college.


> From my peers I get the feeling that this is underappreciated in the tech world where many successful developers skipped college.

I agree that most developers probably underestimate the value of a guidance from a course/expert, and I feel that is largely do to the shear amount of resources available for learning programming.

I have found that as I get into more and more advanced topics the amount of resources available has decreased and has made me want guidance a lot more. Published papers really aren't a great way to learn advanced topics unless you are already an expert in that field.

Mathematics gets hit harder by this than computer science does. I think this is because it is incredibly interconnected and in each topic is very deep; there has been lots of time for the cutting edge to grow.


This is something I really appreciated about some of my well-taught graduate classes (of course, "well-taught" is a big caveat). In undergrad there's often a sense that the professor is supposed to teach you the material. In advanced topics, though, I often found it most useful if the professor actually skipped most of the details, and served as more of a tour guide of the landscape of material for me. Once I know that some specific topic can be learned from a specific tutorial, book chapter, etc., then I can just go learn it on my own. But what's hard is figuring out how to make your way through this sequence entirely DIY, i.e. knowing what to read in what order, whether you're on the right track, where to look if you get stuck, and having someone to answer big-picture questions about how things fit together, why something that seems obvious to me isn't true, etc., etc. I've found some of the best courses helped with that.

A good textbook can also serve this role, but I've found good textbooks rarer than well-taught graduate classes. A lot of textbooks don't really seem to be designed for self-teaching the topic by reading it sequentially cover-to-cover. Maybe because they're often intended to be used in courses, they often have too much material, presented in a somewhat haphazard way, with an expectation that a course instructor will pick and choose parts and supplement it with lectures.


Well it’s two different things. This paper is, in fact, useless in the tech world, since the algorithm is not practical. The paper does, however, has value in advancing academic CS as an intellectual pursuit.


For the very basics, Chapter 30 of Introduction to Algorithms has a very good stand-alone introduction to the topic of large integer multiplication via polynomials and the FFT.

https://en.wikipedia.org/wiki/Introduction_to_Algorithms


Thank you! I'll look into finding a copy soon


Try google :)


Try Library Genesis. :-)


thanks - new to me


I don't know how far you have gotten, but I think it is a good idea to start with FFT-based fast multiplication and read up on whatever unknowns you encounter. In particular I think some abstract algebra (e.g. Dummit-Foote) would be helpful.

There are tons of good expositions on the FFT-way, e.g. check out

"Prime Numbers - A Computational Perspective" by Crandall-Pomerance see ch. 9, Fast algorithms for large-integer arithmetic


If you'd like to see the FFT Multiplication in real-life, practical code implemented / optimized to the hilt, look no further than the GMP library (both source and pre-built binary libs are available).

About the arbitrary precision multiply algorithm:

https://gmplib.org/manual/FFT-Multiplication.html

At large to very large sizes a Fermat style FFT multiplication is used, following Schönhage and Strassen (see References).

Quoted from the GMP web site: https://gmplib.org/

What is GMP?

GMP is a free library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating-point numbers. There is no practical limit to the precision except the ones implied by the available memory in the machine GMP runs on. GMP has a rich set of functions, and the functions have a regular interface.

The main target applications for GMP are cryptography applications and research, Internet security applications, algebra systems, computational algebra research, etc.

GMP is carefully designed to be as fast as possible, both for small operands and for huge operands. The speed is achieved by using fullwords as the basic arithmetic type, by using fast algorithms, with highly optimised assembly code for the most common inner loops for a lot of CPUs, and by a general emphasis on speed.

The first GMP release was made in 1991. It is continually developed and maintained, with a new release about once a year.

Since version 6, GMP is distributed under the dual licenses, GNU LGPL v3 and GNU GPL v2. These licenses make the library free to use, share, and improve, and allow you to pass on the result. The GNU licenses give freedoms, but also set firm restrictions on the use with non-free programs.

GMP is part of the GNU project. For more information about the GNU project, please see the official GNU web site.

GMP's main target platforms are Unix-type systems, such as GNU/Linux, Solaris, HP-UX, Mac OS X/Darwin, BSD, AIX, etc. It also is known to work on Windows in both 32-bit and 64-bit mode.

GMP is brought to you by a team listed in the manual.

GMP is carefully developed and maintained, both technically and legally. We of course inspect and test contributed code carefully, but equally importantly we make sure we have the legal right to distribute the contributions, meaning users can safely use GMP. To achieve this, we will ask contributors to sign paperwork where they allow us to distribute their work.


The abstract of this paper is refreshingly succinct:

We present an algorithm that computes the product of two n-bit integers in O(n log n) bit operations.

The result is excellent, and it closes (in the affirmative) the Schonhage-Strassen conjecture first postulated in 1971.


It establishes the upper bound, but I don't believe this paper establishes a lower bound, so the conjecture is still open (a super linear lower bound would violate the conjecture).


You make a good point, nice catch. Technically we don't know what the fastest possible algorithm is yet.


Just to be sure I understand, doesn't the conjecture postulate that O(n log n) is the lower bound, so _anything_ beneath n log n would violate it?


Yes (technically the lower bound would be the small-o(n log n)). The conjecture is that it’s possible to do in O(n log n) time, which this paper proves, and that it’s not possible to do any faster (which is still an open problem).


The second author of this paper created TeXmacs btw.


And the first contributed much to the large integer and polynomial multiplication code in SageMath...


It's coincidentally pleasing that the cutoff `d >= 1729` is the Hardy-Ramanujan number.


Wow, I learned something today; I had always thought of integer multiplication as a constant-time operation.

This is why I love computer science, no matter how much I think I know, I can feel like an idiot a day later.


If by integer you mean the four or so bytes that many languages use to represent a (bounded) int, then it is indeed constant. This is often the context in software engineering.


I'm somewhat interested in compiler and computation theory, so this is why it was a surprise to me.


For some reason, a lot of people seem to think that operations that have a low level implementation (e.g. basic instruction in some ISA) are O(1).


Tangential question: how does single cycle multiplication works in arm cortex processors? Do they just run the ALU clock fast enough that the multiplication finishes in one visible core cycle or is there a binary trick?


Any binary operation (subject to the usual "well behaved" caveat) can be split up into any number of steps; this is called pipelining. In fact there's not really a reason why multiplication isn't a single cycle step, other than speed and energy efficiency.

The maximum clock speed of an operation is limited by how fast a change in the input can be propagated into the correct change in output. You can get around this by splitting the operation into multiple mini-steps (pipelining) and clocking it to as fast as the slowest stage will go.


Multiplication is done in hardware with a large parallel array of product and sum components:

https://en.wikipedia.org/wiki/Dadda_multiplier

Pipelining, as mentioned in a sibling comment, is used to increase throughput.


The paper applies to integer multiplication on a Turing machine. Actual digital circuits are not bound to that constraint.


>Actual digital circuits are not bound to that constraint.

Yes they are. Hardware can parallelize, but does not remove asymptotic complexity for arbitrarily sized problems. It only changes the hidden constant cost.

The paper result is for unbounded size for n. Hardware has fixed size n, and for a fixed size n, any problem is constant time O(1).


Hardware might have fixed size n, but a general approach certainly doesn't!

I must have failed badly at making my point clearly, because I've got that same criticism more than once.

Take as an example this paper [1], which explicitly states the delay of an n-bit Baugh-Wooley multiplier as O(log n).

I appreciate that this might not be what people are discussing here - if so, excuse me mixing things up. I answered to a question that asked about a connection between the Turing machine algorithm and practical circuit design, and I tried as best I could to give a practical circuit design perspective.

[1] https://www.irjet.net/archives/V3/i2/IRJET-V3I259.pdf


I'm familiar with such results. TAOCP has quite a few I've fiddled with over the years. It has an O(n) multiply, for example.

Most (all?) algorithms can be made O(log n) by throwing hardware at it. The hardware is basically a lookup table, and you can inspect each bit of a length n problem in O(log n) time, and output the answer.

Papers like this use problem structure to reduce the general exponential circuit growth to (usually) ~linear growth. But all I've seen require unbounded circuit growth, as does this one, which is O(n log n) circuit size.

By allowing hardware to grow at unbounded rate, it's not really solving the same problem.


Yep, I appreciate that there is a time/space tradeoff, see this [1] from ~25 minutes ago. I agree it's not exactly the same thing. I just wanted my original comment to not be misrepresented.

[1] https://news.ycombinator.com/item?id=19495509


Sorry if I misrepresented you. Allowing unbounded resources makes all problems trivial, whether hardware or software, using the same lookup trick.

Thus there's nothing gained from hardware, so claiming there is a constraint on software that is not in hardware isn't really correct. Both have the same constraints for fixed size machines, and both allow fast solution for unbounded size machines.

There's plenty of such "other machines" that solve NP hard problems in P time, such as closed timelike curves, chaos machines, infinitely precise engineering, etc., but such machines are generally not physically realizable, and most are not even cost effective for small n, so people use complexity to mean how the problem scales on physically realizable, cost effective devices.

If every conversation about complexity had to go through this same digression to clarify that there are indeed other areas of thought about computation that have different conclusions, then no one could cover the original topic.

Currently the Church–Turing–Deutsch principle seems to be the best formulation of such issues, and Deutsch added the physically realizable part to remove from consideration things that are math but not physics, since such machine concepts are not useful to do actual computation.

All good stuff :)


Andy Yao proved a lower bound on circuit complexity for multiplication of circuit width x depth = Omega(n ^ 2). Trivial carry circuits for adders are O(n).


Is there a known example of something a digital computer can do that a Turing machine cannot?


Turing machines can simulate any other machine, but they can't do what any machine can do. You can simulate a toaster, but only a real machine can actually make toast.


> Is there a known example of something a digital computer can do that a Turing machine cannot?

- Physically exist, since Turing machines are a purely abstract mathematical model that can because of the infinite length of its tape never realized physically.

- Many computation models that are based on a Turing machine are based on the idea:

a) Write the input of the algorithm on the tape

b) Let the Turing machine do the computation

c) When the Turing machine terminates, read the output from the tape.

The problem is: Many things that one want to do on computers do not fit this model:

* interact with the outside word

* programs that do not terminate: for example an OS kernel is not supposed to stop when some computation stops, but should continue scheduling the processes.

Both are simply not part of a computation based on the above idea. I am aware that there of course exist extensions where such things are part of. But here the definition of the Turing machine is IMHO much more "artifical" than its original intent for defining "computability".


In terms of computational power, real computers (ignoring physics) are linear bounded automata, which is a class of automata strictly weaker than Turing machines. Throwing physics into the mix means they’re unreliable computing devices.


Real computers generally have external storage and/or network interfaces, which makes them unbounded.


No, you’re still limited by the computational power of the universe. Since there is a finite amount of energy and a finite number of particles, you’re limited to a finite number of states. The entire universe is no more powerful than a very large, but finite, linear bounded automation.


overheat.


Are there any non-FFT multiplication algorithms that are faster than O(n^2)?

I wonder if you could try to find a systematic way of 'efficiently' representing integers that is amenable to multiplication. Consider squaring 9999, for example. The standard algorithm decomposes 9999 as (9000 + 900 + 90 + 9), then uses distributivity of multiplication, which is clearly n^2 in the number of digits. However, you can achieve the same result more efficiently via 9999 = (10000 - 1), which requires fewer multiplications to square. Can efficient representations of integers be found systematically?


> Are there any non-FFT multiplication algorithms that are faster than O(n^2)?

https://en.wikipedia.org/wiki/Karatsuba_algorithm


Yes, the first one found was Karatsuba multiplication. It's based on splitting the number in about half and doing smaller multiplications.

It is quite interesting to learn, the algorithm is both easy to understand and surprising in result (though less surprising if you know FFT or this result).


Nice, thanks (to you and the sibling comment, who got there at the same time)


A little frustrated here: I was just trying to get background on how fast int-multiplication is and what the fastest algorithms are (and whether this is an advancement), but the Wikipedia article on this topic barely mentions Big-O.

https://en.wikipedia.org/wiki/Multiplication_algorithm


Interesting... what are the chances something like this gets implemented in Silicon and actually speeds up computation or is this purely of theoretical interest?


If the constants aren't too big, this might find use in signal processing.


The constants are massive. Like billions of digits.


For actual silicon, this does not seem like a relevant result. I think there are already time O(log n) multiplier circuits out there.

Edit: A typical imul will probably be no more than O(n), just to add something less speculative.


The big-oh notation is an asymptotic notation, so it is meaningless to describe an imul as being O(n). Given that an imul is doing 64x64 bit multiplications, it is almost a tautology to say it can be done in a constant times 64 ops/cycles.


That was not what I was trying to say, wrong as I may still be.

I meant to talk about nxn bit multiplication. If you scale n then, given the same basic architecture, you will also scale the circuit delay. When the delay scales linearly with the number of bits, I'd call that architecture O(n) in time. To me that seems to make intuitive sense, even though I might have that wrong. The term imul I used merely as a short hand for integer multiplication. I was not alluding to any specific architecture or width, there are plenty of CPU architectures out there using that mnemonic.


> I think there are already time O(log n) multiplier circuits out there.

This only can be magic! It's not possible!


There's always a time/space tradeoff involved, see for example the Wallace tree multiplier [1], which is indeed O(log n) in time.

[1] https://en.wikipedia.org/wiki/Wallace_tree




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: