The piece I've been missing in this whole debate: why isn't the existing RNG simply frozen in its current bug-exact-behavior state and a new /dev/sane_random created?
Stuff that depends on the existing bugs in order to function can keep functioning. Everything else can move to something sane.
Because /dev/sane_random or sane_random(2) has better security properties than what we have now, and you want the whole gamut of Linux software to benefit from that; just as importantly, you don't want /dev/urandom and getrandom(2) to fall into disrepair as attention shifts to the new interface, for the same reason that you care very much about UAF vulnerabilities in crappy old kernel facilities most people don't build new stuff on anymore.
Also, just, it seems unlikely that the kernel project is going to agree to run two entire unrelated CSPRNG subsystems at the same time! The current LRNG is kind of an incoherent mess root and branch; it's not just a matter of slapping a better character device and system call on top of it.
Because you answered their question, I'm hoping you can answer my question.
How is there any overlap in the devices that can't have something clever figured out and devices that could possibly see an update to their kernel code?
Kernel side something clever almost certainly will be figured out eventually just not in time for the 5.18 release (or probably following release either realistically). User space side it doesn't matter if there is an absolutely trivial clever fix available you can't just break it without extremely good reason.
Note: Extremely good reason for breaking userspace is along the lines of "/dev/random has been found to be insecure causing mass security mayhem" not "man I'd really like to ignore the 0.01% of users this would cause an issue for so I can get my patch in faster".
Windows APIs from what I hear share a similar issue to /dev/random (apps rely on bugs in APIs). Maybe the problem is the lack of forward thinking to fix issues.
For a start, there's a long tail of migrating all useful software to /dev/sane_random. Moreover, there's a risk new software accidentally uses the old broken /dev/random.
Besides, /dev/sane_random essentially exists; it's just a sysctl called getrandom().
It's not that simple; Donenfeld wants to replace the whole LRNG with a new engine that uses simpler, more modern, and more secure/easier-to-analyze cryptography, and one of the roadblocks there is that swapping out the engine risks breaking bugs that userland relies on.
What kind of bugs are visible to userland? I would have thought a random number device would be the least likely thing to have upgrade problems like that: applications should not be able to assume anything at all since the output is literally random...
Shouldn't it be easier than a kernel parameter to compare the performance of specific applications that relied upon the current behaviors; at least for a major rev or two?
Stuff that depends on the existing bugs in order to function can keep functioning. Everything else can move to something sane.
Obviously I'm missing something here.