1. I guess the most time consuming part is multiplication, right? What kind of FFT do you use? Schönhage-Strassen, multi-prime NTT, ..? Is it implemented via floating-point numbers or integers?
2. Not sure if you encountered this, but do you have any advice for small mulmod (multiplication reduced by prime modulus)? By small I mean machine-word size (i.e. preferably 64-bits).
3. For larger modulus, what do you use? Is it worth precomputing the inverse by, say, Newton iteration or is it faster to use asymptotically slower algorithms? Do you use Montgomery representation?
4. Does the code use any kind of GCD? What algorithm did you choose?
5. Now this is a bit broad question, but could you perhaps compare the traditional algorithms implemented sequentially (e.g. GMP) and algorithm suitable to run on GPUs? I mean, does it make sense to use, say, a quadratic algorithm amenable to parallel execution, rather than a asymptotically faster (and sequential) algorithm?
3. Answered in point 1., we use IBDWT and get modular reduction for free through the circular convolution. This works nicely for Mersenne modulus.
4. GCD is not used in PRP, but it is used in P-1 (Pollard's P-1 algo). We use GMP GCD on the CPU (as it's a very rare operation, and GMP/CPU is fast enough). I understand the complexity of the GCD as implemented in GMP is logarithmic which is good.
5. For our dimension it does not make sense to use a quadratic algo instead of a NlogN one; We absolutely need the NlogN provided by convolution/FFT.
2. This was discussed to some length over the years on mersenneforum.org [1]. There is a lot of wisdom stored there but hard to find, and many smart & helpful guys, so feel free to ask there. This is an operation of interest because it's:
- involved in TF (Trial Factoring), including on GPUs
- involved in NTT transforms ("integer FFTs")
1. Yes, the core of the algorithm is the modular squaring. The squaring is similar to a multiplication, of course. In general, the fast multiplication is implemented via convolution, via FFTs which results in a N x log(N) time complexity of the multiplication.
What we do is modular squaring iterations:
x := x^2 mod M,
where M== 2^p - 1, i.e. M is the Mersenne number that we test.
Realize that working modulo 2^p - 1 means that 2^p == 1, which corresponds to a circular convolution of size p bits. We use the "Irrational Base Discrete Weighted Transform", IBDWT [1] introduced by Crandall/Fagin to turn this into a convolution of a convenient size N "words", where each word contains about 18bits, so Words ~= p/18. For example our prime of interest M52 was tested with a FFT of size 7.5M == 1024 * 15 * 512.
The FFT is a double precision (FP64) floating point FFT. Depending on the FFT size we can make use of about 18bits per FFT "word", where a "word" corresponds to one FP64 value.
Some tricks involved up to this point are: one FFT size halving and the modular reduction for free because of IBDWT. Another FFT size halving because turning the real input/output values into complex numbers in the FFT.
The FFT implementation that we found appropriate for GPUs is the "matrix FFT", which splits the FFT of size N=A*B into sub-FFTs of size A, one matrix multiplication with about A*B twiddle factors, and sub-FFTs of size B. In practice we split the FFT into three dimensions, e.g. for M52 we used:
7.5M == 1024 * 15 * 512.
We implement in a workgroup one FFT of size 1024 or 512. These are usually base-4 FFTs, with transpositions using LDS (Local Data Share, local per-workgroup memory in OpenCL).
After the inverse FFT, we also need to do Carry propagation which properly turns the convolution into a multi-word multiplication.
For performance we merge a few logical kernels that are invoked in succession into a single big kernel, where possible. The main advantage of doing so is that the data does not need to transit through "global memory" (VRAM) anymore but stays local to the workgroup, which is a large gain.
So, to recap:
- multiplication via convolution
- convolution via FP64 FFT, achieving about 18bits per FP64 word
- modular reduction for free through IBDWT
1. I guess the most time consuming part is multiplication, right? What kind of FFT do you use? Schönhage-Strassen, multi-prime NTT, ..? Is it implemented via floating-point numbers or integers?
2. Not sure if you encountered this, but do you have any advice for small mulmod (multiplication reduced by prime modulus)? By small I mean machine-word size (i.e. preferably 64-bits).
3. For larger modulus, what do you use? Is it worth precomputing the inverse by, say, Newton iteration or is it faster to use asymptotically slower algorithms? Do you use Montgomery representation?
4. Does the code use any kind of GCD? What algorithm did you choose?
5. Now this is a bit broad question, but could you perhaps compare the traditional algorithms implemented sequentially (e.g. GMP) and algorithm suitable to run on GPUs? I mean, does it make sense to use, say, a quadratic algorithm amenable to parallel execution, rather than a asymptotically faster (and sequential) algorithm?