What for? AVX512 instructions can work on, well, 512bits of data at a time, whil...

jdsully · on Oct 2, 2019

AVX512 is not comparable to 128-bit native. AVX/SSE are split into lanes of maximum 64-bits. You can not compute a 128-bit result only multiple 64 bit ones in parallel.

frabert · on Oct 2, 2019

That's a microarchitectural detail, point is you can view modern CPUs as more than 64bit wide, at least databus-wise.

jdsully · on Oct 2, 2019

By that logic the Pentium 3 was a 128-bit CPU. Vector width isn’t how these things are measured because its much harder to make a 128-bit ALU then add two 64-bit ones.

speleo_engr · on Oct 2, 2019

It's not just a detail if you are looking for hardware acceleration on floating point operations on quads (float128 type). AFAIK, nobody has a hardware quad FPU, but there are certainly applications for one. I know that things like big integer arithmetic could greatly be accelerated by a 128-bit computer.

jcranmer · on Oct 2, 2019

I don't think big integer could be greatly accelerated by 128-bit computer. If a 128-bit add takes two cycles of latency, or it causes the frequency to drop (since you need to drive a longer critical path in the ALU in terms of gate delay, which means you need longer cycle times), then you're going to lose a lot in any code that isn't directly involved in the computation, such as loading memory values.

Furthermore, the upside is only at best 2x. It's likely to be worse, because you're still going to be throttled by waiting for memory loads and stores. Knowing that we have 2 64-bit adders available to us to use each cycle, we can still do 128-bit additions at full throughput, although it requires slightly more latency for the carry propagation.

Hardware quad-precision floating point is a more useful scalar 128-bit value to support.

speleo_engr · on Oct 2, 2019

I agree hardware quad is more useful. The big int problem I was talking about would benefit from hardware quad but not hardware 128-bit ALU. The big int problem I have a bit of knowledge of is squaring for the Lucas-Lehmer algorithm to find huge primes (Mersenne primes). The best algorithm in this space is the IBDWT (irrational base discrete weighted transform). You perform an autoconvolution (compute the FFT, square each term in the frequency domain, and then take the IFFT). You want the FFT lengths to be as short as possible, since FFT is an O(N log N) algorithm. Quads would let you use shorter FFTs since you have more bits available for each element.

Even though it is a big int problem, floating point is used. Their are integer FFT algoritms that are usable (NTTs), but they are much slower than floating point FFTs on modern CPUs.

zamadatix · on Oct 2, 2019

> AFAIK, nobody has a hardware quad FPU

Power9 has for a few years.

speleo_engr · on Oct 2, 2019

Today I learned... That's awesome, I'd love to try one out in the cloud sometime and get benchmarks.

joering2 · on Oct 2, 2019

No particular use case. I think its one of these things when they first built machine then all of sudden bunch of different uses case are found.

majewsky · on Oct 2, 2019

Why do I need to build a 128-bit machine to imagine a usecase for it? What you refer to is not applicable to usecases, only to business cases. We first had to build broadband internet before video streaming websites became a viable business model, but surely someone thought of video streaming during the dialup era?