Integer multiplication in time O(n log n) [pdf]

lifthrasiir · on March 25, 2019

For whoever wondering about the practicality of this algorithm, no, at least for now. Constants aside, the algorithm currently requires an enormous cutoff for the recursion (anything below that should be handled by traditional algorithms):

> Let `n0 := 2^(d^12) >= 2^4096`, and suppose that we wish to multiply integers with n bits. For `n >= n0` we will describe a recursive algorithm that reduces the problem to a collection of multiplication problems of size roughly n^(1/d). We will show that this algorithm achieves `M(n) = O(n log n)`, provided that `d >= 1729`. (p. 33)

2^(1729^12) ~= 10^(2.15 * 10^38). The authors do note the possibility of much lower cutoff:

> In this section we outline a number of such modifications that together reduce the constant to `K = 8 + epsilon`, so that the modified algorithm achieves `M(n) = O(n log n)` for any `d >= 9` (rather than `d >= 1729`). (p. 39)

Here 2^(9^12) has about 85 billion decimal digits. Much smaller, but still too big to be practical.

wbhart · on March 25, 2019

There are practical problems in pure mathematics that require multiplication of numbers with 85 billion digits. For example, polynomial multiplication can be reduced to integer multiplication, and certain problems, such as computing congruent numbers or class numbers of quadratic number fields can be reduced to that. Naturally the authors of the paper do not make any claims about the algorithm being practical though.

fdej · on March 25, 2019

Note that number of digits is actually 10^(85 billion), not 85 billion.

lifthrasiir · on March 25, 2019

Yeah, the recent pi calculation amassed 31.4 trillion decimal digits (or about 10^15 binary digits) of the computation load, which would definitely include the integer multiplication of numbers with the comparable size among others. I'm not sure though, because the currently widespread algorithm (Schönhage-Strassen) is practically within two orders of magnitude from O(n log n) [1] and the previous record holder (Fürer's) had been proved extremely infeasible. Also note that 85 billion digits do not relate to the constant itself: it is the lower bound that the new algorithm can make difference, otherwise it reduces to the base case. I'm not really qualified to determine the expected constant of this algorithm though, so it might actually prove feasible.

[1] Its time complexity is O(n log n * log log n). The double logarithm is already slowly growing; log log 2^(10^15) ~= 34 for the reference. At this stage the constant is much more important than the time complexity itself. y-cruncher, a record-setting pi computation software, actually has a set of proprietary algorithms [2] optimized for modern hardwares.

[2] http://www.numberworld.org/y-cruncher/internals/multiplicati...

dragontamer · on March 25, 2019

> Yeah, the recent pi calculation amassed 31.4 trillion decimal digits (or about 10^15 binary digits) of the computation load

But that was to be cool, not for any practical problem.

They needed 156 TiB of storage to hold that singular number, and that happened to be all-solid state storage ( because the majority of the time was spent on I/O, believe it or not!! )

That is to say: a faster CPU wouldn't really make the computation much faster. You need faster storage when you're dealing with numbers that big.

Double-precision floats are what people need most of the time. Crypto-guys need 2048+ or 4096+ byte numbers or something around those sizes for cryptography purposes. I'm not sure if large numbers are really used anywhere else.

lifthrasiir · on March 26, 2019

If you do agree that mathematical problems can be considered practical, many prime-hunting computations do intend to resolve conjectures (Seventeen or Bust [1] was a good example), thus solving practical problems. I of course think that that pi calculation is nothing more than GCP advertisement ;-)

[1] https://en.wikipedia.org/wiki/Seventeen_or_Bust

jacobolus · on March 25, 2019

Or for the audience here, 2^(9^12) is a number that takes roughly 100 gigabytes to store.

Edit: oh wait, 2^(9^12) is the number of bits. So we are actually talking about numbers of the size 2^(2^(9^12)).

“Too big to be practical” is a serious understatement here.

chess93 · on March 25, 2019

On the other hand, in practice, log log n is basically constant so we have had a practical n log n algorithm for a while :)

moab · on March 25, 2019

This very recent paper shows a matching (conditional) lower bound of \Omega(n\log n): https://arxiv.org/abs/1902.10935 (the hardness is from a conjecture in network coding).

davidivadavid · on March 26, 2019

Very interesting. I was just wondering what the lower bound could be (beyond O(n) which seems fairly obvious, even though as a hobbyist I'd be hard pressed to even prove that).

elanora96 · on March 25, 2019

I'm making my way through uni currently. I've taken a few CS classes that have touched briefly on O notation and I am currently in Calc 2. I understand bits and pieces of this but I am lost in most of the paper. What courses should I be taking to better my understanding of papers of this sort? Or alternatively, are there any online resources that could help me work my way through research papers like these?

jacobolus · on March 25, 2019

You can learn all of these ideas on your own (by e.g. going through textbooks and doing a significant proportion of the exercises), but the guidance of a course / expert is pretty helpful.

To understand computational complexity, take a course with a title like “theory of computation” or similar. https://en.wikipedia.org/wiki/Computational_complexity_theor...

To understand linear maps, tensor products, etc., take a course (or 2–3 courses) in linear algebra. To understand various matrix decompositions, take a course in numerical linear algebra. https://en.wikipedia.org/wiki/Tensor_product https://en.wikipedia.org/wiki/Cholesky_decomposition

To understand the FFT and convolutions, take a course in signal processing, maybe after a course in ordinary differential equations. https://en.wikipedia.org/wiki/Fast_Fourier_transform https://en.wikipedia.org/wiki/Convolution

To understand the theory of polynomial rings, take a course in abstract algebra. https://en.wikipedia.org/wiki/Polynomial_ring

To understand numerical approximations and error propagation, take a course in numerical analysis. https://en.wikipedia.org/wiki/Numerical_analysis

Courses in discrete math, algorithms, and complex analysis would also be helpful.

elanora96 · on March 25, 2019

Thank you so much! I'll definitely be looking into courses in these subjects. I can learn a language on my own, but theory is something that I personally need someone to help walk me through.

albertzeyer · on March 25, 2019

Note that really understanding and grasping all of that can easily take some years, as a full-time student, when you are taking all these courses.

fromthestart · on March 25, 2019

>but the guidance of a course / expert is pretty helpful.

From my peers I get the feeling that this is underappreciated in the tech world where many successful developers skipped college.

thethirdone · on March 25, 2019

> From my peers I get the feeling that this is underappreciated in the tech world where many successful developers skipped college.

I agree that most developers probably underestimate the value of a guidance from a course/expert, and I feel that is largely do to the shear amount of resources available for learning programming.

I have found that as I get into more and more advanced topics the amount of resources available has decreased and has made me want guidance a lot more. Published papers really aren't a great way to learn advanced topics unless you are already an expert in that field.

Mathematics gets hit harder by this than computer science does. I think this is because it is incredibly interconnected and in each topic is very deep; there has been lots of time for the cutting edge to grow.

_delirium · on March 25, 2019

This is something I really appreciated about some of my well-taught graduate classes (of course, "well-taught" is a big caveat). In undergrad there's often a sense that the professor is supposed to teach you the material. In advanced topics, though, I often found it most useful if the professor actually skipped most of the details, and served as more of a tour guide of the landscape of material for me. Once I know that some specific topic can be learned from a specific tutorial, book chapter, etc., then I can just go learn it on my own. But what's hard is figuring out how to make your way through this sequence entirely DIY, i.e. knowing what to read in what order, whether you're on the right track, where to look if you get stuck, and having someone to answer big-picture questions about how things fit together, why something that seems obvious to me isn't true, etc., etc. I've found some of the best courses helped with that.

A good textbook can also serve this role, but I've found good textbooks rarer than well-taught graduate classes. A lot of textbooks don't really seem to be designed for self-teaching the topic by reading it sequentially cover-to-cover. Maybe because they're often intended to be used in courses, they often have too much material, presented in a somewhat haphazard way, with an expectation that a course instructor will pick and choose parts and supplement it with lectures.

umanwizard · on March 25, 2019

Well it’s two different things. This paper is, in fact, useless in the tech world, since the algorithm is not practical. The paper does, however, has value in advancing academic CS as an intellectual pursuit.

jpz · on March 25, 2019

For the very basics, Chapter 30 of Introduction to Algorithms has a very good stand-alone introduction to the topic of large integer multiplication via polynomials and the FFT.

https://en.wikipedia.org/wiki/Introduction_to_Algorithms

elanora96 · on March 25, 2019

Thank you! I'll look into finding a copy soon

jpz · on March 25, 2019

Try google :)

wolfgke · on March 25, 2019

Try Library Genesis. :-)

jpz · on March 27, 2019

thanks - new to me

kurlberg · on March 25, 2019

I don't know how far you have gotten, but I think it is a good idea to start with FFT-based fast multiplication and read up on whatever unknowns you encounter. In particular I think some abstract algebra (e.g. Dummit-Foote) would be helpful.

There are tons of good expositions on the FFT-way, e.g. check out

"Prime Numbers - A Computational Perspective" by Crandall-Pomerance see ch. 9, Fast algorithms for large-integer arithmetic

rhw168 · on March 26, 2019

If you'd like to see the FFT Multiplication in real-life, practical code implemented / optimized to the hilt, look no further than the GMP library (both source and pre-built binary libs are available).

About the arbitrary precision multiply algorithm:

https://gmplib.org/manual/FFT-Multiplication.html

At large to very large sizes a Fermat style FFT multiplication is used, following Schönhage and Strassen (see References).

Quoted from the GMP web site: https://gmplib.org/

What is GMP?

GMP is a free library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating-point numbers. There is no practical limit to the precision except the ones implied by the available memory in the machine GMP runs on. GMP has a rich set of functions, and the functions have a regular interface.

The main target applications for GMP are cryptography applications and research, Internet security applications, algebra systems, computational algebra research, etc.

GMP is carefully designed to be as fast as possible, both for small operands and for huge operands. The speed is achieved by using fullwords as the basic arithmetic type, by using fast algorithms, with highly optimised assembly code for the most common inner loops for a lot of CPUs, and by a general emphasis on speed.

The first GMP release was made in 1991. It is continually developed and maintained, with a new release about once a year.

Since version 6, GMP is distributed under the dual licenses, GNU LGPL v3 and GNU GPL v2. These licenses make the library free to use, share, and improve, and allow you to pass on the result. The GNU licenses give freedoms, but also set firm restrictions on the use with non-free programs.

GMP is part of the GNU project. For more information about the GNU project, please see the official GNU web site.

GMP's main target platforms are Unix-type systems, such as GNU/Linux, Solaris, HP-UX, Mac OS X/Darwin, BSD, AIX, etc. It also is known to work on Windows in both 32-bit and 64-bit mode.

GMP is brought to you by a team listed in the manual.

GMP is carefully developed and maintained, both technically and legally. We of course inspect and test contributed code carefully, but equally importantly we make sure we have the legal right to distribute the contributions, meaning users can safely use GMP. To achieve this, we will ask contributors to sign paperwork where they allow us to distribute their work.

throwawaymath · on March 25, 2019

The abstract of this paper is refreshingly succinct:

We present an algorithm that computes the product of two n-bit integers in O(n log n) bit operations.

The result is excellent, and it closes (in the affirmative) the Schonhage-Strassen conjecture first postulated in 1971.

hackcasual · on March 25, 2019

It establishes the upper bound, but I don't believe this paper establishes a lower bound, so the conjecture is still open (a super linear lower bound would violate the conjecture).

throwawaymath · on March 25, 2019

You make a good point, nice catch. Technically we don't know what the fastest possible algorithm is yet.

rincebrain · on March 25, 2019

Just to be sure I understand, doesn't the conjecture postulate that O(n log n) is the lower bound, so _anything_ beneath n log n would violate it?

lovecg · on March 25, 2019

Yes (technically the lower bound would be the small-o(n log n)). The conjecture is that it’s possible to do in O(n log n) time, which this paper proves, and that it’s not possible to do any faster (which is still an open problem).

amichail · on March 25, 2019

The second author of this paper created TeXmacs btw.

williamstein · on March 25, 2019

And the first contributed much to the large integer and polynomial multiplication code in SageMath...

hhmc · on March 25, 2019

It's coincidentally pleasing that the cutoff `d >= 1729` is the Hardy-Ramanujan number.

tombert · on March 25, 2019

Wow, I learned something today; I had always thought of integer multiplication as a constant-time operation.

This is why I love computer science, no matter how much I think I know, I can feel like an idiot a day later.

pragmaticpandy · on March 26, 2019

If by integer you mean the four or so bytes that many languages use to represent a (bounded) int, then it is indeed constant. This is often the context in software engineering.

tombert · on March 26, 2019

I'm somewhat interested in compiler and computation theory, so this is why it was a surprise to me.

davidivadavid · on March 26, 2019

For some reason, a lot of people seem to think that operations that have a low level implementation (e.g. basic instruction in some ISA) are O(1).

nraynaud · on March 25, 2019

Tangential question: how does single cycle multiplication works in arm cortex processors? Do they just run the ALU clock fast enough that the multiplication finishes in one visible core cycle or is there a binary trick?

hatsunearu · on March 25, 2019

Any binary operation (subject to the usual "well behaved" caveat) can be split up into any number of steps; this is called pipelining. In fact there's not really a reason why multiplication isn't a single cycle step, other than speed and energy efficiency.

The maximum clock speed of an operation is limited by how fast a change in the input can be propagated into the correct change in output. You can get around this by splitting the operation into multiple mini-steps (pipelining) and clocking it to as fast as the slowest stage will go.

userbinator · on March 25, 2019

Multiplication is done in hardware with a large parallel array of product and sum components:

https://en.wikipedia.org/wiki/Dadda_multiplier

Pipelining, as mentioned in a sibling comment, is used to increase throughput.

_0ffh · on March 25, 2019

The paper applies to integer multiplication on a Turing machine. Actual digital circuits are not bound to that constraint.

SideQuark · on March 25, 2019

>Actual digital circuits are not bound to that constraint.

Yes they are. Hardware can parallelize, but does not remove asymptotic complexity for arbitrarily sized problems. It only changes the hidden constant cost.

The paper result is for unbounded size for n. Hardware has fixed size n, and for a fixed size n, any problem is constant time O(1).

_0ffh · on March 25, 2019

Hardware might have fixed size n, but a general approach certainly doesn't!

I must have failed badly at making my point clearly, because I've got that same criticism more than once.

Take as an example this paper [1], which explicitly states the delay of an n-bit Baugh-Wooley multiplier as O(log n).

I appreciate that this might not be what people are discussing here - if so, excuse me mixing things up. I answered to a question that asked about a connection between the Turing machine algorithm and practical circuit design, and I tried as best I could to give a practical circuit design perspective.

[1] https://www.irjet.net/archives/V3/i2/IRJET-V3I259.pdf

SideQuark · on March 26, 2019

I'm familiar with such results. TAOCP has quite a few I've fiddled with over the years. It has an O(n) multiply, for example.

Most (all?) algorithms can be made O(log n) by throwing hardware at it. The hardware is basically a lookup table, and you can inspect each bit of a length n problem in O(log n) time, and output the answer.

Papers like this use problem structure to reduce the general exponential circuit growth to (usually) ~linear growth. But all I've seen require unbounded circuit growth, as does this one, which is O(n log n) circuit size.

By allowing hardware to grow at unbounded rate, it's not really solving the same problem.

_0ffh · on March 26, 2019

Yep, I appreciate that there is a time/space tradeoff, see this [1] from ~25 minutes ago. I agree it's not exactly the same thing. I just wanted my original comment to not be misrepresented.

[1] https://news.ycombinator.com/item?id=19495509

SideQuark · on March 27, 2019

Sorry if I misrepresented you. Allowing unbounded resources makes all problems trivial, whether hardware or software, using the same lookup trick.

Thus there's nothing gained from hardware, so claiming there is a constraint on software that is not in hardware isn't really correct. Both have the same constraints for fixed size machines, and both allow fast solution for unbounded size machines.

There's plenty of such "other machines" that solve NP hard problems in P time, such as closed timelike curves, chaos machines, infinitely precise engineering, etc., but such machines are generally not physically realizable, and most are not even cost effective for small n, so people use complexity to mean how the problem scales on physically realizable, cost effective devices.

If every conversation about complexity had to go through this same digression to clarify that there are indeed other areas of thought about computation that have different conclusions, then no one could cover the original topic.

Currently the Church–Turing–Deutsch principle seems to be the best formulation of such issues, and Deutsch added the physically realizable part to remove from consideration things that are math but not physics, since such machine concepts are not useful to do actual computation.

All good stuff :)

systemBuilder · on March 25, 2019

Andy Yao proved a lower bound on circuit complexity for multiplication of circuit width x depth = Omega(n ^ 2). Trivial carry circuits for adders are O(n).

wbhart · on March 25, 2019

Is there a known example of something a digital computer can do that a Turing machine cannot?

Retra · on March 25, 2019

Turing machines can simulate any other machine, but they can't do what any machine can do. You can simulate a toaster, but only a real machine can actually make toast.

wolfgke · on March 25, 2019

> Is there a known example of something a digital computer can do that a Turing machine cannot?

- Physically exist, since Turing machines are a purely abstract mathematical model that can because of the infinite length of its tape never realized physically.

- Many computation models that are based on a Turing machine are based on the idea:

a) Write the input of the algorithm on the tape

b) Let the Turing machine do the computation

c) When the Turing machine terminates, read the output from the tape.

The problem is: Many things that one want to do on computers do not fit this model:

* interact with the outside word

* programs that do not terminate: for example an OS kernel is not supposed to stop when some computation stops, but should continue scheduling the processes.

Both are simply not part of a computation based on the above idea. I am aware that there of course exist extensions where such things are part of. But here the definition of the Turing machine is IMHO much more "artifical" than its original intent for defining "computability".

pmiller2 · on March 25, 2019

In terms of computational power, real computers (ignoring physics) are linear bounded automata, which is a class of automata strictly weaker than Turing machines. Throwing physics into the mix means they’re unreliable computing devices.

kps · on March 25, 2019

Real computers generally have external storage and/or network interfaces, which makes them unbounded.

pmiller2 · on March 25, 2019

No, you’re still limited by the computational power of the universe. Since there is a finite amount of energy and a finite number of particles, you’re limited to a finite number of states. The entire universe is no more powerful than a very large, but finite, linear bounded automation.

your-nanny · on March 25, 2019

overheat.

soVeryTired · on March 25, 2019

Are there any non-FFT multiplication algorithms that are faster than O(n^2)?

I wonder if you could try to find a systematic way of 'efficiently' representing integers that is amenable to multiplication. Consider squaring 9999, for example. The standard algorithm decomposes 9999 as (9000 + 900 + 90 + 9), then uses distributivity of multiplication, which is clearly n^2 in the number of digits. However, you can achieve the same result more efficiently via 9999 = (10000 - 1), which requires fewer multiplications to square. Can efficient representations of integers be found systematically?

klipt · on March 25, 2019

> Are there any non-FFT multiplication algorithms that are faster than O(n^2)?

https://en.wikipedia.org/wiki/Karatsuba_algorithm

kadoban · on March 25, 2019

Yes, the first one found was Karatsuba multiplication. It's based on splitting the number in about half and doing smaller multiplications.

It is quite interesting to learn, the algorithm is both easy to understand and surprising in result (though less surprising if you know FFT or this result).

soVeryTired · on March 25, 2019

Nice, thanks (to you and the sibling comment, who got there at the same time)

SilasX · on March 25, 2019

A little frustrated here: I was just trying to get background on how fast int-multiplication is and what the fastest algorithms are (and whether this is an advancement), but the Wikipedia article on this topic barely mentions Big-O.

https://en.wikipedia.org/wiki/Multiplication_algorithm

jgoodknight · on March 25, 2019

Interesting... what are the chances something like this gets implemented in Silicon and actually speeds up computation or is this purely of theoretical interest?

gaogao · on March 25, 2019

If the constants aren't too big, this might find use in signal processing.

hackcasual · on March 25, 2019

The constants are massive. Like billions of digits.

_0ffh · on March 25, 2019

For actual silicon, this does not seem like a relevant result. I think there are already time O(log n) multiplier circuits out there.

Edit: A typical imul will probably be no more than O(n), just to add something less speculative.

wbhart · on March 25, 2019

The big-oh notation is an asymptotic notation, so it is meaningless to describe an imul as being O(n). Given that an imul is doing 64x64 bit multiplications, it is almost a tautology to say it can be done in a constant times 64 ops/cycles.

_0ffh · on March 25, 2019

That was not what I was trying to say, wrong as I may still be.

I meant to talk about nxn bit multiplication. If you scale n then, given the same basic architecture, you will also scale the circuit delay. When the delay scales linearly with the number of bits, I'd call that architecture O(n) in time. To me that seems to make intuitive sense, even though I might have that wrong. The term imul I used merely as a short hand for integer multiplication. I was not alluding to any specific architecture or width, there are plenty of CPU architectures out there using that mnemonic.

lerax · on March 26, 2019

> I think there are already time O(log n) multiplier circuits out there.

This only can be magic! It's not possible!

_0ffh · on March 26, 2019

There's always a time/space tradeoff involved, see for example the Wallace tree multiplier [1], which is indeed O(log n) in time.

[1] https://en.wikipedia.org/wiki/Wallace_tree