Make your program slower with threads (2014)

kelnos · 2024-06-07T05:17:07

Immediately as I started reading this, my mind said, "random() has global shared state so of course more threads will make it slower". I say this here not to attempt to demonstrate my oh so special advanced knowledge, but just to marvel how bad our abstractions are sometimes. Maybe "bad" is the wrong word. Leaky, inadequate, inscrutable.

"Give me a random number" feels like it shouldn't require all this overhead and hidden complexity under the hood. But unless you've read a lot about implementation details, or have been bitten by that exact problem before, how would you know? You wouldn't, of course.

ketralnis · 2024-06-07T05:43:26

I love the fancy modern type systems and all of the superpowers that they give you. But this is an area that I feel isn't adequately explored.

If you call a function that calls a function that has side effects, or that closes the file descriptor you're using, or could panic, or that needs global shared state, or spins up a thread, or allocates, we don't have a good way to deal with that. There are some partial and underutilised solutions like monadic IO and linear/affine types and, uh, not having panics. I have some (bad) ideas but I think this is a space that's worth playing in.

fullstackchris · 2024-06-07T07:45:55

isnt this the whole point of what rust is trying to accomplish? i havent yet worked with rust but from what i hear they make it very hard (impossible?) to have memory issues and things like deadlocks. sounds great in theory but no idea if this is the reality of using it

ketralnis · 2024-06-07T19:32:17

No. Rust has no controls for side effects or allocations

adhamsalama · 2024-06-07T11:15:59

Rust solves race conditions, but not deadlocks.

felipellrocha · 2024-06-07T08:00:58

It is, yeah. Which is why when you compile 99/100 times it just works. I don’t think people understand that this is one of Rust’s best features.

hnick · 2024-06-07T05:55:27

I had the same thought years ago, I think some people have looked into it a little, but nothing popular yet.

Imagine if you could look at a function, and know, undeniably - this, and anything it can call, absolutely cannot alter the file system directly or make a network call or spawn a process. Maybe instead of just types, you need to supply capabilities (much like interfaces on classes). Sounds like it could be a real pain. Would make auditing easier though.

mrkeen · 2024-06-07T06:24:04

> alter the file system directly or make a network call or spawn a process.

Those three cases are excluded in pure functions.

hnick · 2024-06-07T08:12:10

I've been waiting for functional programming to have its day for more than 20 years. It just seems fundamentally at odds with how most people want to work.

You can make individual functions pure in any language, but do many languages enforce this at a core level? Or we have to behave ourselves and hope nothing we call is changed by another programmer later?

mrkeen · 2024-06-07T08:35:55

> I've been waiting for functional programming to have its day for more than 20 years.

The tech is there. It sounds more like functional programming has been waiting for you for 20 years.

> It just seems fundamentally at odds with how most people want to work.

"Most people", or "you"? If you believe in something, stand up for it.

hnick · 2024-06-07T08:38:16

I've used it. I've never had a job using it. Most people haven't.

ketralnis · 2024-06-07T19:30:46

Capabilities help a lot if you just want to permit/deny the action. But maybe I want to know more than "can it allocate", I also want to be able to pass in an arena to make it allocate in a particular place and I might want that to be different in different parts of the code. Generics sort of help here if you want to explicitly thread your allocator all over the place like Zig but allocation is so common in normal code that ergonomically I want it to be invisible in the common case and controllable in the uncommon case

0xDEAFBEAD · 2024-06-07T06:44:30

How about a spawning a separate process and using seccomp, or having the child process run as a different user perhaps?

There are a couple of advantages to doing it through the OS. For one thing, the implementation is shared across programming languages. For another thing, it's potentially more secure against an attacker trying to inject code.

I guess the disadvantage of doing it at the OS level is that you might have to write OS-specific code.

hnick · 2024-06-07T08:30:21

I think that mostly works (as an aside, last time I looked at this in Windows, we couldn't do something like sudo reliably because there was no secure way to provide the password without user interaction - it seems they've just released sudo for Windows in Feb this year). The OS can support this like how a mobile OS limits the capabilities of apps, which is more secure from a final standpoint.

But I was also thinking of the development and refactoring side of things. Guarantees and assurances. Sometimes I know the generic code I wrote is 'OK' but the C# compiler says no - you're not being specific enough, I can't cast that - so I have to do better. A while ago I was trying to track down programs that loaded data from certain network shares, and it was a bit rough because these calls were all very deep, and very occasional. Traces (or blocks) of system calls only work when the code fires, but my task was to find and replace them all, so we could delete the share (for various reasons string searches only got part of the way). If 'network access' was tagged at a high level, I could follow the trail to see what we actually access and report on that. We had a similar issue identifying jobs that called out to external APIs. We left a log running for a while to catch some, but some only run a few times a year and we didn't have that long. Adding a firewall exception later could take days and these jobs have turnaround times.

I don't know if this is at all feasible. It's just some thoughts I had back then.

arghnoname · 2024-06-07T08:12:34

There have been papers that limit the capabilities of a program within a certain context (i.e., within a function), some of which were implemented at the OS level and enforced with hardware protection (e.g., address space isolation).

The difficulty is that doing this sensibly requires new OS abstractions. It's one thing to put one in a research paper, but it's really tough to get this kind of thing into an OS nowadays. Another difficulty is OS level abstractions, with a few exceptions, cannot be used without a system call, which is cheaper than it used to be but much more expensive than just a function call.

A third problem is just that the programming language has a lot more semantic information, or at least it _can_ have a lot more semantic information, than can be fluently transmitted to the OS. There are approaches to deal with this (like having a richer OS / user land interface, almost impossible to get into an OS). In general, plugging into and extending the type system of some user land is probably going to be much easier route to take.

If the research world worked differently than it does, I'd have loved to continue previous explorations on OS / userland interfaces.

ketralnis · 2024-06-07T19:31:37

This happens at runtime. I want something in the type system.

Yoric · 2024-06-07T05:50:23

Algebraic effects?

bheadmaster · 2024-06-07T11:20:11

As others have already mentioned, the main culprit here is libc and its assumption that it is running in a single-threaded environment.

The libc was designed with heavy use of global variables in mind. Modules in C are designed as compilation units with static global variables and exposed functions that operate on them.

There's a reason many big projects (and programmers) use their own version of the standard library instead of libc.

ykonstant · 2024-06-07T13:52:14

Are there any good, instructive open source examples of custom libc replacements you could link? Especially ones that circumvent the single-threaded assumption of libc; I'd love to use the code to educate myself.

sn9 · 2024-06-07T16:02:43

musl is one of the better ones [0].

None other than John Regehr recommends studying it [1].

[0] https://en.wikipedia.org/wiki/Musl

[1] https://blog.regehr.org/archives/1393

mike_hock · 2024-06-07T08:19:57

How would anyone come to the conclusion that an API that explicitly allows setting a global seed value without mentioning anything about thread-local state, wouldn't contend for some kind of shared resource?

matheusmoreira · 2024-06-07T08:06:31

> "Give me a random number" feels like it shouldn't require all this overhead and hidden complexity under the hood.

It doesn't. The libc is the real problem. The libc and its global state. Even something as simple as errno creates global state problems just like this one. There are buffers all over the place.

Freestanding C is a better language just by virtue of not linking against libc.

glandium · 2024-06-07T11:25:29

At least in glibc, errno is thread local.

loeg · 2024-06-07T00:18:03

Probably the lesson I'd take away from this example is that libc specifically often has surprising shared state, and you have to be really careful about using it from multiple contexts. And of course, you should avoid hammering a mutex from multiple threads -- but you already know that.

stephc_int13 · 2024-06-07T00:30:53

All the different flavors of libc, apart from the outdated and cryptic naming style, have many hidden flaws like this one, heavy multithreading was not a central design feature when it was designed, and there are many instances of global state, with workarounds to avoid bugs, at the cost of potential serious performance issues.

My opinion is that the standard libc should be used much more as a fallback than a default.

And this is especially true about random number generation, there are much better generators out there, and some of them are only a few lines of code.

https://www.romu-random.org/code.c

fanf2 · 2024-06-07T13:55:53

That random number generator depends on global mutable variables without any protection against concurrent access. It isn’t suitable for production use.

cratermoon · 2024-06-07T01:26:27

The SSL cert for that domain expired over 2 years ago Mar 4 23:59:59 2022 GMT

fsckboy · 2024-06-07T01:42:27

http://web.archive.org/web/20210118071040/http://www.romu-ra...

what this will do is download the .c file rather than just show it to you, not sure I understand that enirely, but just so you know.

account42 · 2024-06-07T13:23:56

I assume the response of the original site depends/depended on the user agent and only showed a HTML page for user agents it recognized as a browser, which doesn't include the IA bot.

citizen_friend · 2024-06-07T02:08:19

not supporting multi-threading is not a design flaw, it's a conscious design choice. Multi-threading is hugely overrated as the UNIX kernel already distributes workload across processes and processors. It's a performance hack.

plorkyeran · 2024-06-07T02:55:40

Threading is indeed a performance hack - the whole idea of multiple processes sharing an address space is pretty gross and it's very unsurprising that it leads to many bugs - but there's also a reason why it was universally adopted despite the problems. I do think there's a fair argument that libc's design was not a mistake; the thread-unsafe functions worked very well in the environment they were designed for and it wasn't until decades later in a very different environment that they became a problem. Nonetheless, they are a problem now.

jeffreygoesto · 2024-06-07T04:43:21

My favorite paper about that is https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-...

giantrobot · 2024-06-07T03:18:06

Shared memory address space is a couple (IIRC) orders of magnitude faster to access than any IPC mechanism. Even on modern very fast machines IPC will add noticeable overhead if you tried to replace threads with processes.

citizen_friend · 2024-06-07T13:58:43

The question would be how much data are you sending on how frequently. You're still advertising performance. Even so, go ahead and setup shared memory between your processes. Better to explicitly share part, rather than implicitly share full.

faeriechangling · 2024-06-07T02:26:49

Using processes to manage threading is more of a hack than having threaded processes.

citizen_friend · 2024-06-07T02:30:07

"manage threading"

What does that mean? Do you mean optimize a piece of code to take advantage of multiple cores? Or do you mean describe multiple independent threads of execution. The latter works best when it has it's own memory address and you assume that "if then else" works as you would expect.

teraflop · 2024-06-07T00:48:01

This reminds me of the fact that Go 1.20 added a clever optimization: If you call `rand.Seed` to seed the global RNG, it's assumed that you want it to produce a single deterministic sequence across all threads/goroutines, which requires locking. But if you never explicitly seed it, you get a different implementation that uses thread-local RNG state without locking.

(In practice, there is no way for a program to distinguish this from a single RNG that was initialized with a truly random seed that you can't recover, so it doesn't break backwards compatibility.)

Unfortunately, the C standard prescribes that failing to call `srand` must be equivalent to calling `srand(1)`, so I don't think glibc can get away with this trick.

Mentioned here: https://go.dev/blog/randv2

skissane · 2024-06-07T01:03:09

> Unfortunately, the C standard prescribes that failing to call `srand` must be equivalent to calling `srand(1)`, so I don't think glibc can get away with this trick.

I think it could. Just supply two implementations of rand(): a strictly conforming one without this optimisation, and a less than strictly conforming one with it. The second is implemented in a function with a name like `__gnu_rand_optimized`, and then if you `#define __GNU_RAND_OPTIMIZED` before `#include <stdlib.h>`, then that header does `#define rand __gnu_rand_optimized`.

All the C standard requires is that you provide a mode of operation which strictly conforms to it. A platform is allowed to provide configuration options which cause it to be violated in various ways. Strictly conforming code won't turn on those options and so will work as the standard prescribes. If you choose to turn on such an option, that's a non-portable extension, and how it works is between you and the platform, the C standard isn't involved.

rerdavies · 2024-06-07T03:28:33

Add a new function that modifies behavior of an existing one.

Or add a new function that provides behavior different from the existing one.

The latter is simpler. And safer. And more flexible. And faster, because you don't have a condition to check.

codr7 · 2024-06-07T03:05:57

Exactly my issue with Go.

For a language that talks so much about simplicity, it puts a lot of effort into inventing clever wheels, with mixed results. I recently realized that SQL connections are thread safe and automagically pooled, I would have preferred more manual control. Don't get me started on date/time-formats.

vlovich123 · 2024-06-07T01:12:20

I don’t see why it requires locking. You could easily have a TLS seed that gets initialized from the main thread RNG that’s seeded with Rand.seed. No locking required. It’s not like reading from a single RNG state from multiple thread with locks is any extra deterministic over that approach since you fundamentally would have to synchronize at a higher level to guarantee that the RNG is read deterministically across all thread orderings.

owl57 · 2024-06-07T02:22:43

In Go or any other runtime that schedules code implicitly on multiple OS threads, this wouldn't work: application writer can (and should, if they need deterministic results) ensure that reads from RNG are done in some specific order, but which of these reads will run on which thread is still non-deterministic, and so will be the results if the library uses thread-local RNG state.

vlovich123 · 2024-06-10T03:04:27

sure, a work stealing runtime presents a problem where you wouldn’t use TLS but instead use whatever TLS equivalent is appropriate as TLS+work stealing is a bad idea generally anyway. You want to use fiber-local storage eg dispatch_queue_set_specific for libdispatch, whatever mechanism is available for goroutine-local like Context or sync.Map according to ChatGPT). Btw, if you don’t use goroutines then this problem doesn’t come up I believe as that’s the only situation this comes up in. And since Rand is part of the stdlib, I’m sure they could leverage lower level hooks to accomplish fiber-local storage in an efficient manner.

yuliyp · 2024-06-07T01:18:13

I think the point is that the library probably should assume that if you called `Rand.seed` and/or `srand` it's because you intended to get a specific sequence of random numbers out of it.

vlovich123 · 2024-06-07T01:44:20

What did I write that suggests you wouldn’t get a deterministic specific sequence equivalent to a lock within the RNG? What I said is if you fail to synchronize the surrounding code’s order of calls into the RNG, then you’re going to have non-determinism anyway even with locks within the RNG.

For example:

    T1:
    If rng() < 0.5 {
       Rng()
    }

    T2:
    Rng()

The specific numbers returned by calls to rng() depend on the thread ordering UNLESS the user carefully has higher level synchronization themselves to guarantee the ordering. This shows that the locking within the RNG is insufficient and the better performing option that maintains equivalent invariants should be selected (i.e. initialize the TLS RNG state for the new thread from a call to read entropy from the current thread’s RNG state).

yuliyp · 2024-06-07T12:07:01

I'm trying to figure out how you switch "into" thread-local prng mode after a call to srand.

Case 1: T1: srand(); rng(); rng() T2: no calls to rng

Case 2: T1: srand(); T2: (spawns after the srand call) rng(); rng()

How would T2 know that it should produce the seeded sequence as opposed to one based on its own prng?

vlovich123 · 2024-06-10T02:57:32

srand sets a TLS flag and the thread spawning code checks that flag on-thread to know how to initialize the off-thread rng).

kibwen · 2024-06-07T00:48:37

Personally I think it's kind of silly that we've normalized the idea of PRNGs-as-global-state. I'm not asking everyone to go Full Monad, but please just consider giving each Thing that needs access to randomness its own independent PRNG.

wging · 2024-06-07T01:25:32

The world seems full of APIs that make it easy to avoid global state. Most of my usage of randomness has been through things like rust's rand::thread_rng or Java's ThreadLocalRandom. (In fact I think even java.util.Random uses its own state: the docs call out perf issues but only if you share the same instance across threads.)

Honorable (?) mention goes to (client-side) JavaScript - it's harder to have threading issues if you only have a single thread to work with!

mardifoufs · 2024-06-07T02:30:11

I think Linux has recently seen quite a lot of work on speeding up its PRNGs but something similar to what you proposed didn't end up making it into the kernel due to a few issues, and I think at the time it ended with this :

"Linus Torvalds, though, argued that none of this effort was worth it. Speeding up random-number generation, beyond a point, is not the kernel's job, he said; instead, that should be left to the C library."

https://lwn.net/Articles/919008/

I've not been following the mailing lists so I don't know if that ever changed since then. The dev (Jason Donenfeld) seemed to think there was a way to get it working without the issues that blocked the improvement from getting merged last time though.

Taniwha · 2024-06-07T01:06:55

it's not just random() - it's stdio, and malloc/free/new/delete (though some implementations use per-thread pools with added complexity)

reverius42 · 2024-06-07T01:28:37

Sounds like it’s time to go Full Monad then.

hi-v-rocknroll · 2024-06-07T02:26:59

Blaming PRNGs sounds like bikeshedding when it's the hot paths in user-land that often do too much or do it badly without respecting the caches, the branch predictor, and/or compiler or don't diagonalize the problem sufficiently to break the hot section of the problem into independent steps to run concurrently and in parallel as much as possible. Some classes of problems have no known parallel solution, and that's okay and beneficial in some cases.

Also, if you feel that strongly, please don't roll your own crypto and use a CSPRNG.

wumbo · 2024-06-07T01:16:36

if you have multiple threads working a queue of like-kinded jobs, you’ve given up the determinism of using any sort of global seed

you could pull different set lengths of pseudorandom numbers from the different seeds

owlbite · 2024-06-07T01:22:58

That's what counter-based RNGs are for.

darby_nine · 2024-06-07T01:39:57

How does being a monad intersect with scope i'm genuinely curious

yazzku · 2024-06-07T01:58:10

The Random Monad in Haskell allows you to write pseudo-random code without caring where the prng comes from. It's only when the caller runs evalRand/runRand that it must provide the prng. So you can't possibly screw up the random code by sourcing a global prng even if you wanted to; rather, it's the caller that has full control over the prng.

https://hackage.haskell.org/package/MonadRandom-0.1.3/docs/C...

lalaithion · 2024-06-07T01:42:45

Monads allow representing which global values a function reads to or writes to in the type of the function.

darby_nine · 2024-06-07T02:05:10

Monads do not imply global values or variables.

anyfoo · 2024-06-07T03:23:11

But the stated paradigm is one thing they allow. See the Reader, Writer, and State monads.

pclmulqdq · 2024-06-07T00:20:18

PRNGs are one of those things that need to have global state if you want to get decent statistics out of them. You cannot have sets of threads fetching the same value from the PRNG - it will totally destroy how random the PRNG is. Thread-local PRNGs with separate seeds is the way to go if you need parallelism, and it was the author's solution, but I can easily see not knowing that and running into this.

loeg · 2024-06-07T01:15:32

You can't seed each thread with the same value -- that's all you're saying, right?

jayd16 · 2024-06-07T01:23:55

Result distribution (like averaging to zero) only happens at scale. Even if you use a different seed every time you might not see the proper distribution when looking across different generator instances.

loeg · 2024-06-07T02:30:37

If you're running into contention with multiple threads on a shared PRNG, you're probably at sufficient scale.

jayd16 · 2024-06-07T05:11:10

Sure. Just saying that, for example, if you spawn threads and use the thread ID as the seed but pull a single value per thread, that's not sufficient.

pclmulqdq · 2024-06-07T01:25:49

Correct, you must use different, uncorrelated seeds for each thread if you want this to work. You can seed them all from one CSPRNG, for example, or from the Linux entropy service.

stephc_int13 · 2024-06-07T00:49:59

The generator state should either be part of the API or be thread local, but there are probably complex implications regarding the long legacy of libc.

Dylan16807 · 2024-06-07T01:55:28

> You cannot have sets of threads fetching the same value from the PRNG - it will totally destroy how random the PRNG is.

This is the rule you need to follow, but I don't understand what you're saying about global state.

The broken version has global state.

Non-broken versions can have global state or not have global state, depending on implementation choice. And that choice is separate from whether they have good performance.

pclmulqdq · 2024-06-07T02:58:22

How do you guarantee that you won't get the same value from a PRNG without global state? A PRNG is a state-updating machine that spits out a value based on that state.

ayende · 2024-06-07T03:01:41

Thread local state

You don't need it global

Seed it once per thread from /dev/random and you are done

pclmulqdq · 2024-06-07T03:22:17

As I suggested in the first comment here, yes. I believe the GP believed that you could have one PRNG that did not have global state for its consumers.

Dylan16807 · 2024-06-07T20:57:27

No, I'm not saying that.

I'm saying that if you have one PRNG, then you have global state no matter how it's designed. This is true whether you write it so that you get decent statistics, or you write it so you get tons of duplicate values.

And many of the fixes remove global state. Per-thread PRNGs are one option, but so are PRNGs that are used by specific batches of objects.

So, the straightforward broken option has global state, and the non-broken options might or might not have global state.

Which means I have no idea what you're talking about when you use the phrase "need to have global state". What is the PRNG API where that need exists, and what does the version without global state look like?

pclmulqdq · 2024-06-08T01:32:26

Every PRNG algorithm has state that is shared among all of its consumers. That is basically true by definition. Put another way, a PRNG is an iterative algorithm that compresses a very long random-looking (the mathematical term is "equidistributed") sequence. That iterative algorithm needs to maintain its position in the sequence, and there's no alternative. There is no algorithm for a PRNG that does not have state that is shared among all of its consumers.

The only way to make a PRNG that does not share state between its consumers is to shard per-consumer. The PRNG in libc doesn't do that, it just uses a lock internal to the PRNG to keep things safe.

You could attempt to make a concurrent PRNG that has atomic-sized state and use CAS loops when generating numbers to keep the consumers in sync, but that's probably worse than using a lock.

Dylan16807 · 2024-06-08T02:13:26

> Every PRNG algorithm has state that is shared among all of its consumers. That is basically true by definition.

Right, but again why did you say it needs to have global state "if you want to get decent statistics"?

What is the alternative to having global state that doesn't have decent statistics?

That "if" is the part I'm confused about.

The part of your comment directly after that sounds like you're describing the situation where clients are data-racing the PRNG and generating repeat numbers and other problems. But that's a problem specifically because of global state being used incorrectly, not something that happens as an alternative to global state.

pclmulqdq · 2024-06-08T11:04:22

The alternative is not being random, and "decent statistics" means "being pseudorandom" - the statistics are inseparable from the randomness. You can make a PRNG that has data races, but from a randomness perspective, that is equivalent to just using /dev/null as your random number generator.

Dylan16807 · 2024-06-08T20:19:28

Okay, I think I understand. You're being extremely harsh in how you consider brokenness, so if it's broken it might as well not be generating anything at all, no state needed.

I was considering "fairly broken PRNG" as its own significant group, and those need state too if you treat them as separate from "this literally gives the same number every time". But if those all go in the same bucket, then the comparison you made makes sense.

owlbite · 2024-06-07T01:24:29

Algorithms for producing the same stream of random numbers in serial or parallel have been well known for many years. Perhaps the easiest in concept is "encrypt the sequence 1, 2, 3, 4, 5, ...".

pclmulqdq · 2024-06-07T01:26:55

The author of the blog wanted each thread to have different random numbers, not the same stream of numbers.

matheusmoreira · 2024-06-07T08:25:18

Is there a reason why the RNG function would not be able to simply take the random seed as one of its arguments?

randomonad · 2024-06-07T00:47:45

Thanks, you answered the question I came here to ask, i.e. why did it need a lock for a random number, what are the tradeoffs of using the non locking version. Makes sense.

tombert · 2024-06-07T00:07:57

I feel like it's a rite of passage for every engineer to write a program with the assumption of "more threads = faster", only to find their program going considerably slower.

atrettel · 2024-06-07T00:58:49

Similarly, vectorizing some code doesn't always speed things up. I'm dealing with this problem at work right now. Sometimes complicated loops are pretty well optimized already! That said, vectorization can often make things easier to read.

fnordpiglet · 2024-06-07T00:54:35

There was a time that was how you achieved throughput. IPC was janky and unreliable and often slow, and threads offered a cleaner interface for cooperative multitasking on a single runtime. That’s changed as IPC and async improved, and the dangers of threaded led to more safety in threading, leading to slower threading performance.

mandevil · 2024-06-07T00:20:46

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”- Donald Knuth, The Art of Computer Programming

foobiekr · 2024-06-07T00:44:03

Sadly, this no longer really applies. Programmers today do absolutely grotesque things unimaginable to Knuth of that era. One look at any modern pipeline that converts to and from JSON a dozen times, sometimes even inside the same process between modules, would make Knuth recall all of that.

Programmers now have done the impossible: written code so consistently bad that there are no hotspots, because the whole codebase is non-performant trash.

hindsightbias · 2024-06-07T01:30:22

I was going to get a t-shirt printed with “Knuth Was Wrong”

We’re going to be in a world of hurt when the newest process node doesn’t save us.

rerdavies · 2024-06-07T03:35:56

Knuth was right! Profiling would indicate whether your JSON serialization/deserialization is actually a problem. Maybe it is. Maybe it isn't.

xiasongh · 2024-06-07T00:43:07

There's key parts left out of that quote that changes the tone quite a bit. Here is the full one

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." - Donald Knuth

Animats · 2024-06-07T04:29:42

Knuth was writing in an era when scientific programs tended to be dominated by inner loops. There was an input part, a compute part that did some number-crunching, and an output part. Only performance in the compute part mattered.

Many programs today have no inner loop. Compilers were the first important programs that didn't. Most interactive programs have an outer loop processing events, rather than an inner compute loop.

Note that "AI" programs are much more like the scientific problems of Knuth's era. All the compute is in tight loops wrangling matrices.

tombert · 2024-06-07T00:24:52

Well the last time the thread thing bit me was actually a case where it wasn't premature optimization; I had written the code in such a way that I thought was pretty enough, and we were hitting bottlenecks, so my genius brain thought "ok I'll make this use 20 threads and it'll go faster". It did not.

asa400 · 2024-06-07T02:00:44

Pretty sure I’ve made exactly the same mistake. I feel like everyone who ever writes concurrent code learns that lesson at some point. It’s absolutely astonishing how much mileage one can get out of thread-per-core-fed-by-work-queues architectures.

gbin · 2024-06-07T02:06:46

Oh with modern architectures this sentence is so wrong. If you don't think about grooming the hardware in the right way from the get go you will never touch peak performance by a couple orders of magnitudes period. See how games are developed and I tell you they don't just use OOP with virtual interfaces and neat indirections all over the place then think oh it is ok we will optimize after the fact.

starik36 · 2024-06-07T02:01:52

Many, many, many years ago, my employer needed me to write some code to read 8 COM ports in OS/2. Then based on that, do some operations.

So, me, being brand new to OS/2, immediately wanted to try out the new multi-threading features. Without knowing anything about locks and spins or deadlocks or shared state, I plunged into it just using the OS/2 documentation and a programming book I bought. Keep in mind, this was a single CPU machine, probably an Intel 486dx or something like that.

I spent the next couple of weeks debugging all sorts of things that should not have been happening. There were crashes, lock up, slow downs, missing variables, etc... that I couldn't resolve.

After a while, I gave up and did what the OP did: just put everything in a loop and it solved everything.

DeathArrow · 2024-06-07T05:20:35

I don't know what can we learn from analyzing a poorly written program aside from what not to do.

If the program was properly written there wouldn't be any issues with multithreading.

The title of the article gives the false impression that multithreading is slowing programs in the general case. In fact poorly written programs are slow and that is true for both multithreaded and singlethreaded programs.

If your program runs slow but you don't need it to run fast, don't bother.

If your program runs slow and you need it to run fast, go see what optimizations can be done.

nlitened · 2024-06-07T05:33:00

> the false impression that multithreading is slowing programs in the general case. In fact poorly written programs are slow

From my experience, in the general case programs are poorly written, so yeah, multithreading makes them 1) more complex, more poorly written, and 2) slower in many cases.

Animats · 2024-06-07T03:17:55

I've had trouble with futex congestion. Worst case was under Wine. Wine has DLL files which emulate various Microsoft libc-level functions. They have their own memory allocator. It has three nested locks, some of them have spinlocks, and at least one is locked while a "realloc" call is in progress and a buffer is being recopied.

Now try a multi-thread program compute bound program with more threads than CPUs that does Rust vector "push" operations. Push causes an array to grow, locks get set, other threads hit locks, threads go into spinlock mode, threads inside spinlocks context switch, lose control, and performance drops by two orders of magnitude. Most CPU time is going into spinlocks.

I'm not convinced that user-space spinlocks are a good idea. If you're never compute-bound, they can work, but if you run out of CPU time and a context switch occurs inside a spinlock, it's all downhill from there.

Windows itself doesn't do this; it has a different locking strategy.

dang · 2024-06-07T02:24:57

Discussed at the time:

Make your program slower with threads - https://news.ycombinator.com/item?id=8711162 - Dec 2014 (47 comments)

wonrax · 2024-06-07T02:38:52

This is the exact same problem I encountered back when I was taking the Parallel Computing course, except I was an absolute novice and had no idea about debugging or profiling. I remember it took me several days speculating and searching for an answer to this issue on Google, and I finally found an explanation on Stack Overflow, which also suggested using rand_r. It was one of the first instances where I was introduced to solving and debugging a difficult computer programming problem, and it still sticks with me to this day.

overfl0w · 2024-06-07T09:18:56

Good article, also has references to the glib's source. I've never dived deep into the random() system function before. This is from the man pages:

"The random() function should not be used in multithreaded programs where reproducible behavior is required. Use random_r(3) for that purpose."

Perhaps the notes should be updated to state that the performance of multi-threaded programs is also affected due to the shared global resource?

beyonddream · 2024-06-07T02:10:32

This needs 2014 in the title.

InvOfSmallC · 2024-06-09T14:25:12

Same happened to me. I used random then deployed to product. I could see the CPU on grafana going nuts. That was one rollback I never forgot.

AndyKelley · 2024-06-07T02:49:15

If you used zig you wouldn't have this problem, just grab your entropy from std.crypto.random and you have yourself a thread-local RNG, properly seeded, with fork safety.

oah · 2024-06-08T02:37:12

A sort isn’t necessary. Duplicates can be counted in linear time by using additional memory (e.g. with a hashmap)

However, the point of the article still stands.

dkarl · 2024-06-07T02:42:50

"There's no semantic reason my program isn't embarrassingly parallel but oh WTF are my libraries/frameworks doing? is it all a lie?" is apparently the pretty universal first experience trying to speed things up with threads. Hopefully we're working towards a future where single threaded legacies don't routinely stab the unwary in the back.

I remember in the early 2000s doubling the performance of a CPU-bound C++ program by using two threads, and my boss being shocked that I was able to do it. He had written it off as a naive idea. I was too young and inexperienced to understand why he was surprised; to me it just seemed like he didn't understand what I was doing. In retrospect, now I feel like maybe he had the right default expectation, and I got lucky that it worked out so easily.