Any chance you can share more details on your measurement setup and eval protocols? You're likely seeing some config snafus, which we're trying to track down.
I can't share the eval, but it's pretty simple: it asks a question about some data, and is restricted to only answer yes/no (based on the output logits and suggested in the prompt). It's called with 0 temperature and only 1 output token, so sampling shouldn't be an issue.