Its an interesting read but the synopsis is a glibc bug[1] in 2.27 that causes pthread_cond_signal to sometimes fail to wake waiting threads. This manifests as various hangs/locks that may seem to happen randomly in different application runtimes, including reports of users of .NET Core, Python, OCaml.
The comment there makes it sound like it might be giving up on the stealing optimization entirely? If that's the case then this might not be the fix to roll out everywhere.
Nail on head. Pthreads is an extremely difficult API to use, owing first to a poor choice of primitives, the consequences of which are that none of the implementations are particularly good and their interactions with the rest of the machine/model are ill-defined.
For example: You know how hard it is to close() in a multithreaded program? [1] And that’s just one system call— one almost everyone thinks they understand. What hope is there for all the other pthreads-code?
For cond/etc: One trick I use is handle passing via an epoll[2]. This works much more reliably than the pthreads API, and whilst it’s slower in a microbenchmark that spins the pthread_cond/mutex/etc apis, real programs don’t do that (or don’t need to), and so it is usually a clear net-win.
I'm kinda surprised none of the model-checking people have tried verifying glibc's NPTL implementation; I guess it might be hard to formalize futex's behavior?
This bug is really, really subtle. The repro that was posted that just tries to stress this bug can take hours to deadlock on some machines. Confirming that you've fixed it properly is a bit harrowing.
Even as an OCaml snob I haven't a clue how you would verify the behavior of this C library formally.
https://sourceware.org/bugzilla/show_bug.cgi?id=25847