I suspect that windows still has a subtle FP restoration bug. We do large scale validation of floating point data and occasionally get ever so subtly different results.
I would be interested in hear the use cases for large scale validation of floating point data. I used to work with processors that occasionally corrupted operations due to hardware manufacturing defects and these kinds of problems are exceptionally hard to debug, so I'm curious what techniques are used.
In our case, we built programs that ran enormous numbers of semi-random programs on the accelerator and compared it to reliable results computed offline. About 1 in 1000 chips would - reproducibly - fail certain operations. Identifying this helped solve problems many of our researchers reported on specific accelerator clusters- they would get a Nan in their gradients which would kill training, and it was almost always explainable by a single processor (out of ~thousands) occasionally corrupting a float.
This is a known thing that happens with bad drivers, they can mess with user-mode FP flags. For a while (IIRC) Cisco's VPN software was corrupting FP state and causing Firefox to hang/crash, for example.
At a previous company we had to run a dll in our web servers (provided by the payment processing company) for PCI compliance reasons, and we later discovered it was messing with our FP flags and as a result serialization code was producing invalid floats. That was a fun one.
Given that you say "subtly", have you ruled out rounding/precision errors? I wouldn't be surprised if some processors would play fast-and-loose with the number of significant bits they really honour.