> One unanswered question remains – why didn’t this code break right away, even when compiled with VS2003?
I suspect VS2003 didn't enable SSE by default and was compiling the code using the x87 FPU, in which case all intermediate operations were done using 80-bit floats with only a final cast at the end.
While it is not mentioned in the article, I am aware of /arch:IA32 (don't use SSE instructions) and I have tried it - no luck.
On the other hand, using /fp:fast with VS2017 did give me similar results to what VS2003/VS2010 produces - so if I were to guess, VS2003 generates x87 math in a manner similar to fast math, while precise math breaks it?
Then again, it may not be strictly due to /fp:precise - I tried compiling with VS2010 too, and even though it was /fp:precise the bug did not manifest itself.
Perhaps comparing assembly code would have given a definite answer, but since it is so obvious from source I don't think it is worth the effort.
I suspect VS2003 didn't enable SSE by default and was compiling the code using the x87 FPU, in which case all intermediate operations were done using 80-bit floats with only a final cast at the end.