This is very similar to Montgomery reduction [1]. Both do a multiply taking low ...

This is very similar to Montgomery reduction [1]. Both do a multiply taking low bits followed by a multiply taking the high bits to reduce a number.

Montgomery reduction requires an augmented number system with conversions to and from. The article's approach works on numbers directly, but if I understand it correctly the multiplications are twice the width.

Are these two methods algebraically related? Would it be possible to get fastmod working without the double width?

[1]: https://en.wikipedia.org/wiki/Montgomery_modular_multiplicat...