Detecting Node Event Loop Blockers

naugtur · on March 17, 2022

Hi, blocked-at author here.

Get in touch over email if you want to explore further.

Some quick comments:

- I didn't notice the node version or you didn't state it. The impact from async hooks is _vastly_ different between node versions

- I need to update the package a bit, I've got a known perf improvement to add AFAIR.

- your implementation is likely to produce false positives. The trick to prevent (some) false positives is the most valuable part of the lib.

Also, take a look at the event loop utilization metric introduced last year https://m.youtube.com/watch?v=WetXnEPraYM&list=PL0CdgOSSGlBa...

And for more of my diagnostics experiments see debugging-aid package https://www.npmjs.com/package/debugging-aid

abhikp · on March 17, 2022

Hey! We (I work at Ashby) are on latest node 16. We'll update the post to reflect that. Happy to chat over email (edit: found your email and will reach out)

naugtur · on March 17, 2022

We could break blocked-at into two layers so that you could provide the data collection mechanism. You could report to datadog while using the false positive filter.

Also, there's a bunch of tools you could use before you deploy blocked-at or async hooks at all.

Note it's midnight here and I'll stop responding very soon :D

jerf · on March 17, 2022

This is ultimately what killed cooperative scheduling in the 90s and the early 2000s. It works at first, then it keeps working, and indeed, it keeps working for a long time. If you've got a problem below that threshold, you're probably be fine. But eventually, as you scale up, you will eventually hit the problem that you've got things that sometimes run longer than you thought, sometimes a lot longer than you thought, and it locks the whole system up. You can maybe fix the first few, as the article does here, but the problem just gets larger and larger and eventually you run out of optimization juice if your system has to keep getting bigger.

MacOS prior to X was cooperatively scheduled across the whole OS, and it was definitely breaking down and why Apple needed such a radical change. Likely you aren't cooperatively scaling on this scale, which is why it continues to work with many programs.

In some sense, the problem is that every line of code you write is essentially an event loop blocker. Sooner or later, as you roll the dice, one of them is going to end up taking longer than you thought. It's the same basic forces that make it so that nobody, no matter how experienced, really knows how their code is going to run until they put it under a profiler. There's just too much stuff going on nowadays for anyone to keep track of it all and know in advance.

But it is just one of those things you need to look at when first sitting down to decide what to implement your new system in. If you run the math and you need the n'th degree of performance and memory control, don't pick Python, for instance. If the profile of tasks you need to run is going to be very irregular and consume lots of the CPU and doesn't have easy ways to break it apart (or it would be too much developer effort to keep manually breaking these tasks apart), don't pick a cooperatively-scheduled language.

Fortunately, it's a lot easier than it used to be to try to take these particular tasks out and move them to another language and another process, while not having to rewrite the entire code base to get those bits out.

eyelidlessness · on March 17, 2022

> MacOS prior to X was cooperatively scheduled across the whole OS, and it was definitely breaking down and why Apple needed such a radical change.

Preemptive multitasking was definitely one of the headline features of Mac OS X, but I think the other major one—protected memory—was the more important problem to solve with classic MacOS. Performance on classic MacOS was better for many common workloads than OS X, for several versions of the latter. And yes, playing a video while minimizing the window was a cool demo, and that capability is table stakes for an OS now. But the crashing bomb app demo was far more compelling for most of us Mac greybeards.

That said, cooperative concurrency in JS is a very different thing than in an OS. It’s not a panacea of course, but the typical APIs available are pervasively concurrent or trivial to make concurrent. And where that’s not enough (CPU bound workloads), all mainstream JS environments now support real threads. Granted that doesn’t mean JS is an ideal solution for CPU bound workloads… but if JS is already a prerequisite, it’s worth considering scaling out with threads before more expensive process spawning and IPC.

paxys · on March 17, 2022

This is why Node.js is well suited for tiny microservices where a single person (or small group of people) can have context on every line of code running in the thread and can identify and fix such blockers. When the application grows and evolves beyond that, sometimes to the complexity of an operating system, it is going to be impossible to keep it performant for these reasons.

3pt14159 · on March 17, 2022

> n'th degree of performance and memory control, don't pick Python, for instance.

Or be comfortable rolling the critical parts in another language like cython, which I've done and got a 10000x performance speedup because the hot part of the code fit in the CPU cache and the rest was plain old python. There are times to start with something like Go when you're absolutely sure you'll need it, but I rarely regret starting in Python. It's usually easy enough to shift a single endpoint to another language or to use extensions to call into something faster.

dgellow · on March 17, 2022

You will also rarely regret starting with Go. It's not really a language you would use to have fine memory control and the best performances, so I'm not sure why you mention it as a response to the text you quoted. Go is more like a good balance between low memory usage, good performances, while being a very productive tool.

samhw · on March 17, 2022

This is true. I'm a Rust fanatic myself, but I grudgingly admit that I wouldn't start a company with Rust. I just wouldn't trust future engineers enough. Whereas with Go, I'd never write it in my spare time, and I hate the paternalism and limitedness, but those same qualities make it an excellent quality for a company language.

If there's one thing I'd say about Go, it's that it's a language you can roll out across hundreds of junior engineers writing relatively sophisticated code and trust that you'll get a very respectable balance of (a) runtime performance, (b) developer productivity, and (c) safety (memory, thread[0], type, etc).

[0] OK, not in the strict way that Rust offers. But the simplicity and verbosity makes it easy to spot errors, the standard lib offers excellent primitives for concurrent programming, and `go test -race` sweeps up most of the rest.

egberts1 · on March 17, 2022

Only non productive part is the endless juggling of Go modules’ versioning.

dgellow · on March 18, 2022

I don’t know what you mean by that. The only situation where I spent time fighting with go dependencies is when people use vanity URLs, but that has nothing to do with modules. I haven’t experienced any issue with module versioning.

mfbx9da4 · on March 17, 2022

This is excellent and an answer to my deleted question on stackoverflow! https://stackoverflow.com/questions/69886637/measure-time-ta...

hexsprite · on March 17, 2022

how come it was deleted?

lmarcos · on March 17, 2022

When working in a green field project, does Node.js provide any advantages over, let's say, Go? In medium to big size teams, I find tricky to keep the event loop free of blockers. In this regard, Go's goroutines model make it easier to not block the whole app due to silly mistakes.

lmeyerov · on March 17, 2022

Our nodejs + python backends are all: multiple processes -> async/await nodejs/python event loops (used to be frp/streams) . It's good to solve event loop bottlenecks, but more from a individual query's latency perspective, not a machine utilization perspective: processes take care of that. Go seems great for hiring ambitious CPU coders, and Rust for ones making it safe, but the stdlib in both seem not there yet, so for most teams, the perceived increase in productivity is an illusion as it's wasted on building unnecessary things. (We actually went back to Python for web just so we can use Django, as the Node world is still too manual.)

Conversely, anything 'heavy' is probably too slow in Go/Rust too! We use GPUs for anything data-intensive, and the Python ecosystem is basically the best for that, similar to C/Java in earlier CPU eras. (I can see that changing in 2+ years as Arrow progresses and more investment happens.)

rrgleuotyh · on March 18, 2022

In a small team, having the same language, libraries, and especially tooling in the frontend and the backend can be a superpower.

Other than that think it comes down to established tooling and team familiarity / preference. I do think there's no reason to start a green field Node.js project without TypeScript though.

des429 · on March 17, 2022

fyi to anyone seriously considering using these: neither is going to work as expected if something blocks the loop indefinitely. In other words, you won't know how long something blocked the loop until that thing has finished blocking. Timeouts for async code or limits on loop statements are still relevant.

naugtur · on March 17, 2022

If something is blocking the eventloop indefinitely, it's unlikely to reach production and even if it does, a simple crash report or perf inspection will reveal it. You don't need more tools than node-report for that case. Permanent eventloop block is the simple case.

Regular eventloop blocking by synchronous processing is the middle ground.

Performance issues with utilization of resources, too many promises or broken backpressure - that's where the fun begins.

If you can run your software locally and simulate load, just use node clinic. It's the best looking one ;)