Hacker News new | past | comments | ask | show | jobs | submit login
Fibers aren’t useful for much any more (microsoft.com)
93 points by mappu on Oct 11, 2019 | hide | past | favorite | 58 comments



Some of the comments refer to the paper at http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p136..., which was the original submitted URL.


There are also (at least) two relevant followups to this paper:

Response to “Fibers under the magnifying glass”: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p086...

Response to response to "Fibers under the magnifying glass": http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p152...


"Response to 'Fibers under the magnifying glass'" from the authors of boost.fiber, at

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p086....

And Response to response to "Fibers under the magnifying glass", at

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p152...


The saga of stackful corutines vs stackless coroutines Vs zero overhead coroutines has been going on in C++ land for a while now.

Gor so far seems to be ahead as stackless coroutines are part of the standard.


This is deeply weird for the conversation, IMO this should have been two separate stories.


It is best to treat the linked paper (Fibers under the magnifying glass) as a review of some implementations of fibers and not a repudiation of fibers in general. In particular, for this paper to apply to golang, several things would need to change :

Memory footprint : It says fiber user stack is 1 MB and so fibers have comparable memory footprint to threads. This is not true for goroutines which typically use a 4K stack.

Context switching overhead : gives numbers for architecture, but goroutines do not use the expensive switching instructions listed in the paper. Instead golang basically saves just the PC, SP and DX registers, significantly reducing the overhead.

Dangers of N:M model : The dangers mentioned of corrupting memory etc is specific to C++ libraries and do not apply to golang.

Dangers of the 1:N model do not apply to goroutines either.

My conclusion from the paper is as follows : Fibers are bound to fail as an OS feature, or as a library. To make fibers work you need to do what golang does i.e. make it part of the language with compiler support to reduce the context switching overhead and the memory footprint. You will however, pay a price in higher FFI cost. That may be a tradeoff which may or may not work for you, depending on the nature of your application.


But if coroutines can't share memory without copying; they don't allow you to scale (in parallel on one task across cores efficiently).

What is then the purpose of coroutines as opposed to just having more machines?

OS threads can scale (ipootace^), but then you need proper concurrent data structures and a complex memory model below to support them.

Does go have those?


If you watch just one video on Go, even if you don’t plan to ever write Go and just want to learn about it.

Make it this one: https://blog.golang.org/concurrency-is-not-parallelism

It explains in detail how this is possible.


Goroutines can share memory without copying.


I only skimmed the article, but is the argument basically that fibers suck because library/DLL authors suck and use TLS instead of passing around an explicitly caller-initialized context pointer? So you can't freely use DLLs from your fibers because they might rely on TLS that isn't fibers-aware?

If so, that's a pretty weak argument. I use coroutines/fibers quite a bit in personal projects, but learned long ago for a variety of reasons to avoid depending on third party libraries I didn't have source for - especially ones that try to do too much magic like TLS behind the scenes just to save me the trouble of supplying an instance/context pointer to every call.

Usually when I'm using fibers it's so I can have many of them, which means I'm using tiny stack sizes, which means I'm not casually calling into third party libraries I can't easily audit and control anyways. If I weren't making many, I'd just use full-blown threads.


> especially ones that try to do too much magic like TLS behind the scenes just to save me the trouble of supplying an instance/context pointer to every call.

I really hate that this is the solution that Rust ended up pursuing. There are claims that you can work around TLS by just no using TLS if it is not available, but I have yet to see someone removing TLS and still be able to use a multi-threaded executor.


The real solution without TLS will come, in my understanding, after generators improve. Specifically, they cannot currently take arguments on resume.


Using heap as sole savior of stackless coroutines just doesn't scale. I'm working on a game engine, we can have 10-100k+ jobs running per frame, modern monitors have refresh rates of 144/240/300 Hz, so worst case: 30 million jobs per seconds! There is no time for heap - is just too slow.

So we need to preallocate everything for all systems, but hey, if you preallocate everything for worst case - you are out of RAM (on PS4/Xbox/etc).

What we need is a mix: scratchpad memory that lives longer than stack, but costs as cheap as stack. Tagged heap with stack/arena allocator comes close performance wise but not ergonomics wise.

Ergonomics of writing code with such constrains is very painful. Stack is the most ergonomical/fastest scratchpad you can have, and as soon as you have async/await/etc in a middle of your function - you need to think about unwind/rewind of every stack variable.


Ah, thanks, so is the problem cache misses?

Java had to rewrite the memory model of the whole JVM when they introduced the concurrent package.

Even C++ has a hard time making good use of concurrency and threads because of locking.

My hunch is that because of the way memory works there is no benefit for non GC languages when implementing threads and concurrent memory over many cores.


> My hunch is that because of the way memory works there is no benefit for non GC languages when implementing threads and concurrent memory over many cores.

I've been pretty happy with Rust on multi-core servers, probably because Rust never allows mutable memory to be seen by more than one function.

I've been using async Rust at work, which turns out to particularly interesting, because most of the async executors for Rust can actually move suspended async functions between cores. So you might have dozens and dozens of async routines running across an 8 CPU pool.

There was definitely a bit of a learning curve involved, but we've only ever encountered a single concurrency-related bug, which was a deadlock. (Rust protects against memory corruption, undefined behavior and data races, but not deadlocks.)


Ok, but when you say that they can move between cores, isn't that the OS moving them?

If rust does not allow concurrent memory reads, then that's a big problem in my world.

It seems to me we have a collision between "data driven" and "parallelism on the same memory" in terms of progressing on performance and in that case since I can get parallelism with memory and execution safety with hot-deployment on a VM; the data driven approach does not fit my server side performance needs to the point where I'm able to give those other features up.

On the client, I'm all for C with arrays though.


Disclaimer: I've done some reading on Rust's concepts but not yet done any substantial coding. Corrections welcome :)

I'm not informed enough to comment on how async works exactly in Rust. However:

> If rust does not allow concurrent memory reads, then that's a big problem in my world.

Note that this is not the goal of the borrow checking system of Rust. You can read memory concurrently just fine (given immutable references to something), you're just not allowed to write to it while you're doing that.

Basically, references in Rust come as immutable (like `const` in C) and mutable, and you're only allowed to have multiple immutable or one mutable reference to the same thing at a time. If you have a mutable reference, you can derive multiple immutable ones from that, but the borrow checker will prevent you from accessing the mutable reference as long as one of the immutable ones is still active (which Rust manages with the concept of "lifetimes").


This paper is rather biased, in that the downsides to stackless coroutines are not mentioned, namely a more complicated control flow and associated increased difficulty of debugging.


So I read through this, and the conclusion was to not use fibers. However, most of the reasoning seems to surround issues with things like TLS, allocators, and stack memory usage in C++. There is no explicit recommendation here to not use goroutines for scalable, concurrent software as far as I can tell, just to not use fibers.


Goroutines and fibers are the same thing: M:N threading.

It is true that many of the fiber issues presented here are C++-specific. However, what a lot of the comments here are missing is that C++ issues have a way of becoming your issues whenever you use an FFI, even if you aren't using C++. Go's solution is generally to try to avoid using cgo as much as possible, because of these performance issues. That can work for the areas Go is generally used in today. But, as the article points out, that does not work for all applications. For example, I would not want to write graphics code in any system with M:N threading due to FFI cost, including Go.


Please give an example from the document that applies to Goroutines. (The best I can see is the bits about issues with split stacks, but it was resolved.) I think my reading of the document holds up.


Not only that, but in the context-switching overhead, the numbers provided are not appropriate to golang - because go's function calling convention assumes fewer registers are saved, thus reducing context switching overhead.

From : https://codeburst.io/why-goroutines-are-not-lightweight-thre...

"In Go, this means only 3 registers i.e. PC, SP and DX (Data Registers) being updated during context switch rather than all registers (e.g. AVX, Floating Point, MMX)"

Perhaps a better title would be "Fibers require compiler / language support to be viable"


Section 3.6, page 8, talks about the FFI overhead of Go.


Sure, but if that is truly the only part of the document that contains reasoning to not use goroutines, I can’t imagine how one could read the conclusion as suggesting goroutines are unsuitable for scalable software. In fact, I’ve now worked at multiple companies doing exactly this in Go. With Docker it was often preferably to explicitly disable CGo. It would be abnormal in say, C#, to dock points because of C interop.

It’s also worth noting that FFI is not the only way to have Go and C++ interop. For many use cases a lightweight RPC layer between two apps will give better throughput, something that also is done in production to great effect.


> It would be abnormal in say, C#, to dock points because of C interop.

Not really. WinForms is a lot of the reason for C#'s existence, and WinForms is just a wrapper around pinvoke'd Win32. You're crossing the boundary a lot.

> For many use cases a lightweight RPC layer between two apps will give better throughput, something that also is done in production to great effect.

I have a hard time believing that RPC can possibly be faster than cgo. You have the overhead of message serialization and deserialization, two message copies (into the kernel and out of the kernel), two context switches, and a trip through the OS scheduler.


> Not really. WinForms is a lot of the reason for C#'s existence, and WinForms is just a wrapper around pinvoke'd Win32. You're crossing the boundary a lot.

That is an implementation detail. I also don’t know many who consider WinForms to be particularly high performance.

(Additional note: though I have not explicitly said it prior, I believe that PInvoke actually was quite slow for a long time, at least certainly during the WinForms era. For all I know, it might still be.)

> I have a hard time believing that RPC can possibly be faster than cgo.

I can’t find a solid reference, but the issue is that Cgo is simply not ideal for heavy applications. It makes scheduling slower. If you are doing expensive work in C++, such as phoning out to the network or decently heavy computation to the point where Cgo overhead is not the concern, then you are unlikely to have much issue with the cost of RPCs. If you are doing tiny amounts of work with no IO one must wonder why you would not just port those bits to Go.

(Example of scheduler issue: https://github.com/golang/go/issues/19574)

I continue to contend that considering this to be a show stopper to be unfair or at least not very honest.


> Not really. WinForms is a lot of the reason for C#'s existence, and WinForms is just a wrapper around pinvoke'd Win32. You're crossing the boundary a lot.

I wonder if that's why WPF does so very much on the managed side. And I wonder if using UWP XAML from C# is less efficient than WPF in some scenarios because of this FFI overhead.


According to React Native for Windows team it is hardly noticeable.

On their benchmarks comparing XAML/C++, XAML/C#, RN and Electron, it is hardly a few percentile more than C++.

It is Electron that goes sky high in performance loss.


We have already forgotten that Sun spent a lot of time on M:N threading and abandoned it.

Linux seems to have gone the other direction, supporting thousands to tens of thousands of threads.


For a small window of time, after everybody agreed that LinuxThread needed rewriting, one of the most promising candidate was an M:N library (from IBM I think, I forgot the name). In the end NPTL won and the rest is history.


It was called NGPT.


"Fibers (sometimes called stackful coroutines or user mode cooperatively scheduled threads) and stackless coroutines (compiler synthesized state machines) represent two distinct programming facilities with vast performance and functionality differences."

It's pretty clear that fibers cover goroutines, green threads, M:N threads, etc., this is just Microsoft-specific name for them.


It does yes. However, they still do not seem to be making the claim that goroutines are a bad idea. The name conflation is unfortunate, but almost all of the problems are C++-specific, and their conclusion fails to be precise about this.


The paper states that fibers, including goroutines, are a bad idea. The part that is not specific to C++ is the FFI cost (cited as 160 ns in the paper).


We already have a thread discussing this. If this is truly a problem, then I shall ask why C’s Go interop is so bad - if it were better, we’d have access to a much nicer TLS library!


The headline is inaccurate. It doesn't say inappropriateness of the goroutine. It just says what Go does for goroutine doesn't fit what C++ needs. The author studies and knows more than average about Go, but still modest enough not to make any judgement. Let us the readers respect it.


It seems like the determination was made by how they interop with C++ libraries, not that they were inappropriate for every situation.


Question for Win32 experts: does anybody know how a DLL receives thread notifications? Is there any way to make an EXE get the same thing directly (even if it's undocumented)? It's a little weird for me because DLLs are loaded in user-mode -- why can't an EXE request the same notifications?


DllMain gets called with a reason code. The OS loader calls it by enumerating the loaded DLLs. It won't call anything in the EXE because that's how it was coded, and there's nothing the EXE can do about the loader specifically short of patching OS code.


I'm wondering what happens before all this -- how is the OS loader even notified about the new thread? Is it the thread entrypoint itself that tells the OS loader about the new thread's creation (and destruction)?


Both the loader and thread management are part of the OS. It's an implementation detail, but I'd expect CreateThread to do it - perhaps by delegating to the loader, perhaps by navigating the loader's list of loaded modules, whatever.

See these pages:

https://docs.microsoft.com/en-us/windows/win32/api/processth...

https://docs.microsoft.com/en-us/windows/win32/dlls/dllmain


The thing about CreateThread doing it is that then a thread created in a different matter (CreateRemoteThread from another process, RtlCreateUserThread, etc.) would cause a missed notification. I feel like it has to be the entrypoint, but not sure...


Sure, but when I say CreateThread I mean the implementation of CreateThread, not the function CreateThread.

(This feels like a weird autistic conversation, I'm going to step out now.)


I don't think people using monadic concurrency in haskell or ocaml will agree with this.


IO in Haskell (and presumably Lwt in OCaml, though I'm less familiar with it) doesn't have much to do with fibers.

An IO value is just something that the Haskell runtime is able to invoke somehow. Haskell functions can not directly run IO values (ignoring unsafePerformIO). A fairly elegant implementation would probably simply make IO values be asynchronous operations (procedures that take an "on complete" callback that receives the result)—again, since there's no way for a Haskell function to actually run the operation, all it can do is return such an operation to the runtime to be called.


Monads don't have much to do with it, but the GHC runtime uses fibers for concurrency, no?


Monadic code can be polymorphically async-or-not in a safe way rather than just making all your yield points invisible.


This is a wildly editorialized and misleading title. I clicked through just to see how it was going to rationalize the fact that people demonstrably have been building scalable concurrent Go software, with goroutines, at truly huge scale. But of course, the paper says nothing of the sort; it makes an aside about how an earlier design of the Go runtime was less scalable than the current one, and that's it.

This is a textbook example of why people shouldn't editorialize titles.

The right title here is "Fibers Under A Microscope".


> But of course, the paper says nothing of the sort; it makes an aside about how an earlier design of the Go runtime was less scalable than the current one, and that's it.

The paper makes two references to Go: one to talk about split stacks, and one to reference the FFI (specifically, cgo). Cgo absolutely still has large overhead. That was true when the paper is written, and it's true today. It's inherent to the M:N small-stack design that your FFI calls that require big stacks require switching from a small stack to a big stack. You cannot get that overhead down to zero; it's a fundamental tradeoff of the design.

Go's solution here is to try to minimize the amount of FFI usage. As the paper points out, that may work for Go, but will absolutely not work for many other use cases.


A weird hidden benefit is that Go ends up having a lot of fundamental lib re-impmemented in Go itself. And as a user, that can be convenient for all your batteries to be in Go. Assuming someone else does the hard work obviously.


Given the amount of subtle bugs that this has caused and continues to cause in Go, I don’t really consider this a benefit.


I just changed it and then came here to see this comment.

Submitted title was "Microsoft: fibers/goroutines inappropriate for scalable concurrent software[pdf]".

Submitters: "Please use the original title, unless it is misleading or linkbait; don't editorialize." This is in the site guidelines (https://news.ycombinator.com/newsguidelines.html).


Sorry, I agree it's not a great title. I was worried the paper's title was misleading (just sounds like textiles), so I chose a representative statement from the author's abstract + conclusion.

The link comes via https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=10... which summarized the PDF as """a fantastic write-up of the history of fibers and why they suck. Of particular note is that nearly all of the original proponents of fibers subsequently abandoned them [...] fibers are basically dead""".

By restricting itself to the TIOBE top 10, the paper also misses a discussion of BEAM which successfully offers N:M threading.


Oh, in that case let's just switch to the URL to that blog post and let it make its point directly.

Changed from http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p136... above.


I had to scroll through all the comments to find this to understand why the comments don't match the article. I think you should have left it, now it's very confusing.


That's why I posted https://news.ycombinator.com/item?id=21230286 and pinned it to the top.

Experience has shown that switching to a better URL is generally better for discussion, though there can be a lag before the thread catches up.


I am far of being an Erlang expert. But is not the Erlang concurrency model also N:M threading ?

If so, it is also a successful usage of Fibers/green threading in highly concurrent environment.


From what I understand Erlang has a naive memory model; so whatever a process does, it cannot share efficiently.

It's made to handle 1000s of 1-to-1 calls that all fit individually on one core, not one 1000-to-1000 call that needs to use all cores at the same time.

Parallelism without "intra cpu and inter core" forced scope has never been hard, just spin up more machines.

Or if I put it this way: "if you don't need parallelism and memory speed between the parallelism enough to encounter cache miss problems, could you as well use separate computers?"

If the answer is yes then fibers and coroutines are meaningless.

This is the problem of this decade, if we can't solve the bottle neck of cache misses on sharing memory (in parallel on one task across cores efficiently) we have no reason to try and scale the number of cores at all.

And if that's true we have hit peak Moore's law for transistor computers; in performance, years ago and in energy efficiency, probably around 8nm.

Just to illustrate the problem one last time: Naughty Dog converted it's engine to fibers to allow 60 frames per second, but the controller input to frame latency increased with many frames (because the multiple cores cooperating on each frame meant they have to push the frame back because memory is slow) so the benefit was actually reduced.

You get lots of smooth bells and whistles (that look good when you don't play the game) but the only meaningful metric (how fast the character acts on your reactions, which is the definition of gameplay) actually degrades.

That said I'm looking forward to TLoU 2 as much as the next guy, but I'm not expecting any technical improvements except visuals.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: