It's not just a detail if you are looking for hardware acceleration on floating point operations on quads (float128 type). AFAIK, nobody has a hardware quad FPU, but there are certainly applications for one. I know that things like big integer arithmetic could greatly be accelerated by a 128-bit computer.
I don't think big integer could be greatly accelerated by 128-bit computer. If a 128-bit add takes two cycles of latency, or it causes the frequency to drop (since you need to drive a longer critical path in the ALU in terms of gate delay, which means you need longer cycle times), then you're going to lose a lot in any code that isn't directly involved in the computation, such as loading memory values.
Furthermore, the upside is only at best 2x. It's likely to be worse, because you're still going to be throttled by waiting for memory loads and stores. Knowing that we have 2 64-bit adders available to us to use each cycle, we can still do 128-bit additions at full throughput, although it requires slightly more latency for the carry propagation.
Hardware quad-precision floating point is a more useful scalar 128-bit value to support.
I agree hardware quad is more useful. The big int problem I was talking about would benefit from hardware quad but not hardware 128-bit ALU. The big int problem I have a bit of knowledge of is squaring for the Lucas-Lehmer algorithm to find huge primes (Mersenne primes). The best algorithm in this space is the IBDWT (irrational base discrete weighted transform). You perform an autoconvolution (compute the FFT, square each term in the frequency domain, and then take the IFFT). You want the FFT lengths to be as short as possible, since FFT is an O(N log N) algorithm. Quads would let you use shorter FFTs since you have more bits available for each element.
Even though it is a big int problem, floating point is used. Their are integer FFT algoritms that are usable (NTTs), but they are much slower than floating point FFTs on modern CPUs.