Hacker News new | past | comments | ask | show | jobs | submit login
Is there a known recent Linux locking bug that affects the OCaml runtime? (ocaml.org)
108 points by lelf on Oct 12, 2020 | hide | past | favorite | 17 comments



Its an interesting read but the synopsis is a glibc bug[1] in 2.27 that causes pthread_cond_signal to sometimes fail to wake waiting threads. This manifests as various hangs/locks that may seem to happen randomly in different application runtimes, including reports of users of .NET Core, Python, OCaml.

https://sourceware.org/bugzilla/show_bug.cgi?id=25847


Looks like this fix takes care of the problem. Now the challenge is to get it rolled out everywhere: https://sourceware.org/bugzilla/attachment.cgi?id=12484


The comment there makes it sound like it might be giving up on the stealing optimization entirely? If that's the case then this might not be the fix to roll out everywhere.


Do you want to wake up at 3am to determine that yes, the broken pthreads implementation really is broken?


Do you think a desire to put more than five minutes into designing the patch means I don't want it fixed, or don't want it fixed quickly?


No, this is only in the undo path when stealing already failed. But waking all waiters is kind of crazy.

Microsoft’s equivalent API (WFMO) is free of spurious wakes, I can’t imagine how much harder that makes everything


If pthread_cond_signal() is broken pthreads is useless for all but the most trivial threaded programs.


Nail on head. Pthreads is an extremely difficult API to use, owing first to a poor choice of primitives, the consequences of which are that none of the implementations are particularly good and their interactions with the rest of the machine/model are ill-defined.

For example: You know how hard it is to close() in a multithreaded program? [1] And that’s just one system call— one almost everyone thinks they understand. What hope is there for all the other pthreads-code?

For cond/etc: One trick I use is handle passing via an epoll[2]. This works much more reliably than the pthreads API, and whilst it’s slower in a microbenchmark that spins the pthread_cond/mutex/etc apis, real programs don’t do that (or don’t need to), and so it is usually a clear net-win.

[1]: http://geocar.sdf1.org/close.html

[2]: http://geocar.sdf1.org/fast-servers.html


I believe the latest POSIX spec has fixed the close issue. Long story is https://www.austingroupbugs.net/view.php?id=529


> One thread, however, is blocked in libc read on stdin. When I hit Enter, the read finishes and the whole program unhangs.

Seems like a problem easily solved with a Cron job!

/me exits building


I'm kinda surprised none of the model-checking people have tried verifying glibc's NPTL implementation; I guess it might be hard to formalize futex's behavior?


This sounds to me like a good PhD project and/or series of research papers.

Disclaimer: I'm not in the model-checking community.

If it's hard to formalize futex's behavior, all the better for academics, though it's unfortunate for the rest of us.


This bug is really, really subtle. The repro that was posted that just tries to stress this bug can take hours to deadlock on some machines. Confirming that you've fixed it properly is a bit harrowing.

Even as an OCaml snob I haven't a clue how you would verify the behavior of this C library formally.


I'm going to go out on a limb and guess mbacarella works at Jane Street?



Used to, I left in 2015. Still using OCaml!


Still experiencing KDE Plasma lockups in Debian Bulleye (11).

It’s time for me to hook up a dumb terminal to its serial port and leave it there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: