You can't be bug compatible for things that violate the processor's protection protocol. Access to certain bits of the EFLAGS register is unavailable to unprivileged code. In fact. Just because you were allowed to raid and pillage by Microsoft for a few years doesn't mean it's the norm.
You cannot meaningfully virtualize access to EFLAGS:IF. You can either emulate(/JIT) almost whole CPU or ignore this issue. And anyway, turning of interrupts is something that essentially does not make sense for user process, so it is better to just disallow that (which is what almost everything else but non-NT windows does)
It was usually used as an ultra- critical section. It would have been fine to simulate it as a scheduler-freeze for that process, preserving the meaning without getting hung up on the hardware implementation.
But instead, Intel decided to try to support actually messing with the interrupt enable state, resulting in years of highly-inefficient "solutions" e.g. trapping and simulating. Sigh.
A big-hammer approach is to set thread affinity for your process to one hyperthread/processor. But that loses the opportunity for lovely parallelism.
A finer-grained approach is to have a flag bit that prevents preemption, perhaps even just preemption by threads of the same process. This is weaker than CLI because it doesn't prevent I/O callbacks etc from preempting; ideally those would be suspended as well for the process.
This assume a non-priviledge flag word i.e. user-mode code owns the "process flags", not the kernel.
My favorite solution is a "process signal register" in hardware. Its a wide register full of test-and-set bits, shared by threads of a process. They can be used to implement critical section, semaphore, event, even waiting on a timer. All without a trip thru the kernel - essentially zero-latency kernel primitives.
Wouldn't an unprivileged EFLAGS-lookalike cause problems of the CLI-HLT persuasion?
And "process signal registers", while sounding attractive, aren't really a feasible alternative, given that the number of processes running even on uniprocessors are overwhelming, at least. Plus, if they're beyond CPU control, privilege issues arise again.
In short, yes, there are many alternatives, but the current model works, and not just for x86. And you know what engineers say..
I would say that not writing code that requires this. Only valid reason for wrapping something in CLI/STI is when you want to directly control some timing-critical hardware, which is something that simply does not belong into userspace. I would say that in most cases such code does not even belong to kernel, but into some interface controller of said hardware. Other cases are pretty well handled by normal APIs presented by kernel (mutexes, signal flags...).
Wait, they topped Windows 95 and Windows ME?
Is that even possible?