Yes, the paper is general and covers the case for arbitrary bit size. 64-bit would require a 128-bit approximate inverse in general and the multiplication during computing the mod would extend that out another 64-bits. So would require three registers and two more multiplications on x64 I think. Socidont think it gains you anything in the general case. Bit for some divisors it might beat the traditional Granlund-Montgomery-Warren method when you could use a smaller approximate inverse.