The Go runtime scheduler's way of dealing with system calls

protomyth · on Dec 8, 2019

https://marc.info/?l=openbsd-cvs&m=157500930922882&w=2

For dynamic binaries, we continue to to permit the main program exec segment because "go" (and potentially a few other applications) have embedded system calls in the main program. Hopefully at least go gets fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous reasons, as we notice the ecosystem has many of static-syscall-in-base-binary which are dynamically linked against libraries which in turn use libc, which contains another set of syscall stubs. We've been concerned about adding even one additional syscall entry point... but go's approach tends to double the entry-point attack surface.

https://marc.info/?l=openbsd-tech&m=157488907117170&w=2

[edit for convenience of readers - read the above linked thread - I just grabbed the go part]

Unfortunately our current go build model hasn't followed solaris/macos approach yet of calling libc stubs, and uses the inappropriate "embed system calls directly" method, so for now we'll need to authorize the main program text as well. A comment in exec_elf.c explains this.

If go is adapted to call library-based system call stubs on OpenBSD as well, this problem will go away. There may be other environments creating raw system calls. I guess we'll need to find them as time goes by, and hope in time we can repair those also.

[/edit]

Reelin · on Dec 8, 2019

Related (AFAIU): https://news.ycombinator.com/item?id=21653119

> We've been concerned about adding even one additional syscall entry point

I don't understand the need for such a severe "only libc syscalls ever" approach.

What would be the security concern with allowing syscalls only from preauthorized (ie msyscall(2)) regions, making initial region authorization opt-in (instead of opt-out), allowing the program to call msyscall(2) itself, and rejecting any statically linked (ie non-ASLR'd) regions for authorization?

masklinn · on Dec 8, 2019

> I don't understand the need for such a severe "only libc syscalls ever" approach.

There's nothing severe about it. Most systems are exactly that: systems of which the kernel is only one part, syscalls are rarely if ever intended to be called directly nilly-willy.

The issue is that unlike windows unices have never enforced this.

amscanne · on Dec 8, 2019

Sorry, that generalization doesn’t hold water.

It makes sense for systems where libc is tightly coupled and coversioned with the kernel, e.g. BSDs, but Linux always relied on third-party C libraries and supported static binaries, etc.

You could argue that BSD made the mistake of intending to have a Windows-style C library compat guarantee but not enforcing it, but that was not in scope for Linux. The philosophy has always been syscall-level compat (and there are lots of famous threads with Linus re-enforcing this to others who would presume that things should be “fixed in user space”).

So it’s hardly reasonable to generalize based on some BSD concerns; Linux is WAI and represents the most common Unix-like system people use today by far.

There’s a pretty good argument that this level of compat, while the source of some problems, has also made other things much easier: consider container images that are bundled with their own system libraries. (You could certainly invent schemes to inject these libraries, but dealing with link and library level compatibility seems even more complex to deal with than system call-level compatibility.)

pcwalton · on Dec 8, 2019

Darwin/macOS has the same rules as Windows and the BSDs--syscalls are private API--and it's extremely popular due to iOS. Linux is in fact the odd one out here.

giovannibajo1 · on Dec 8, 2019

There is a difference though: libSystem on Darwin is a very thin wrapper over the kernel syscalls; on the contrary, libc is a library that was designed for C, then standardized in POSIX, and has several layer of abstraction over kernel syscalls including many bad defaults that are universally recognized as wrong today (eg: libc’s created file descriptors will all inherit by default).

loeg · on Dec 8, 2019

Go isn't obligated to use any libc APIs or abstractions other than those providing syscalls.

You're incorrect or maybe just misleading about libc created file descriptors inheriting, as stated. Either way, it is unrelated to using libc for syscalls vs bare machine traps.

giovannibajo1 · on Dec 9, 2019

I think it's mandated by POSIX standard; but even if I'm wrong, there's still the problem that libc doesn't allow to do that atomically, for instance. In general, it's an old interface that doesn't fully expose the full power of all modern syscalls.

loeg · on Dec 9, 2019

You're mistaken.

https://news.ycombinator.com/item?id=21740035

https://news.ycombinator.com/item?id=21740498

tedunangst · on Dec 9, 2019

Most libc you'd care to use are going to include functions like dup3. If they don't, it's probably because the system call isn't available either.

cesarb · on Dec 8, 2019

> (eg: libc’s created file descriptors will all inherit by default)

Isn't that the kernel default? Even if you use system calls directly, file descriptors still inherit by default.

loeg · on Dec 9, 2019

Yeah, libc's syscall wrappers just do what you tell them. If you don't pass O_CLOEXEC to the kernel syscalls, you get the inherit behavior. Libc's syscall wrappers don't change this in any way.

To the extent that Go's default for file descriptors today is !inherit (I'm unfamiliar, but if so, it's a good choice), the Go runtime must already add O_CLOEXEC to bare syscalls. There's no reason to believe it incapable of adding the flag to libc syscalls instead.

giovannibajo1 · on Dec 9, 2019

You can't do that atomically with libc. There's a short window in which the file descriptor will potentially be inherited, if another thread forks.

jlokier · on Dec 9, 2019

That's incorrect.

You are thinking of the older way, where fcntl(fd, F_SETFD, FD_CLOEXEC) must be used after open(), leaving a short window in which the file descriptor may be inherited.

The newer way passes the O_CLOEXEC to open() and there is no fcntl() call. This is atomic with respect to inheritability: The kernel returns a non-inheritable file descriptor to libc, and libc returns it to the application.

Other syscalls that return a file descriptor have similar flags, so they are atomic too.

These flags and behaviours are exactly the same, whether done by calling through libc as most programs do, or direct kernel syscalls bypassing libc, as Go and a few other programs do.

loeg · on Dec 9, 2019

Unfortunately, you misunderstand how CLOEXEC works and how the Go runtime implements the feature you think libc lacks.

This syscall level behavior is POSIX-specified[1] since at least the 2008 edition[2]:

> O_CLOEXEC > If set, the FD_CLOEXEC flag for the new file descriptor shall be set.

What that means is, any C program or Go program that passes the O_CLOEXEC flag to open(2) on a POSIX 2008 conforming system (including Linux and the BSDs, for example), will atomically create the fd without inherit behavior. There is no "short window" and hasn't been for more than a decade. The Go runtime must use that flag to provide that property; there is no other way on these systems. Libc users are of course able to use the same flag.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/o...

[2]: https://pubs.opengroup.org/onlinepubs/9699919799.2008edition...

Gaelan · on Dec 8, 2019

I mean, theoretically (i.e. I have no idea if anything does this), you could have underlying system file descriptors, which did not inherit, then a mapping from "libc" file descriptors onto OS ones and some code in fork() to copy over any OS file descriptors that are exposed to libc-using code.

loeg · on Dec 9, 2019

The mapping is the identity function and whether a descriptor is inherit or not is just a function of O_CLOEXEC / fcntl(FD_CLOEXEC) / some special fd types are always cloexec, such as kqueues on FreeBSD. Libc fds aren't special to the operating system in any way.

typical182 · on Dec 8, 2019

For Go on Darwin:

libSystem is now used when making syscalls on Darwin, ensuring forward-compatibility with future versions of macOS and iOS.

From https://golang.org/doc/go1.12#darwin

loeg · on Dec 8, 2019

This is mentioned obliquely in the top-level comment for this thread:

> solaris/macos approach … of calling libc stubs

barrkel · on Dec 8, 2019

libc is a C runtime library, while kernel32 is not.

justincormack · on Dec 8, 2019

The plan for containers on Solaris, after rejecting injecting libc from the host, was to have users rebuild all containers after OS upgrade. Windows has to virtualise containers with an incompatible OS version. It is definitely less convenient than Linux.

OpenBSD does have somewhat different constraints and they seem to think this will work for them.

binarycrusader · on Dec 8, 2019

The plan for containers on Solaris, after rejecting injecting libc from the host, was to have users rebuild all containers after OS upgrade.

What containers are you referring to? Because this is definitely not how Zones work on Solaris.

There are two types of Zones in Solaris 11; "Kernel Zones" which run their own independent version of Solaris and "non-global Zones" which are automatically kept at the same version as the host.

Windows has to virtualise containers with an incompatible OS version.

Not as far as I'm aware. Windows Sandboxes don't work that way nor do other technologies I'm aware of. What are you referring to?

justincormack · on Dec 15, 2019

This was Docker containers, which were under development, and I think one version shipped.

Windows containers have to run under hyper-v for incompatible kernel versions.

xenadu02 · on Dec 8, 2019

The iOS Simulator does something along these lines for macOS. When macOS is updated CoreSimulator rebuilds the dyld_sim shared caches because simulator runtimes pull in libsystem_kernel, libsystem_pthread, and libsystem_platform (which cover the core of the kernel’s ABI).

masklinn · on Dec 9, 2019

> It makes sense for systems where libc is tightly coupled and coversioned with the kernel, e.g. BSDs, but Linux always relied on third-party C libraries and supported static binaries, etc.

Linux is not a system. Linux is a kernel, linux has distributions, it is not a single coherent system where the kernel and standard library are co-developed. That's the entire point I'm making.

> So it’s hardly reasonable to generalize based on some BSD concerns

It's not "some BSD concerns", it's pretty much every non-linux unix. What's not reasonable is generalising "syscalls are a perfectly fine interface" which is almost exclusively a Linux exclusivity. Or don't claim compatibility with anything other than linux, that's also a perfectly fine choice.

> There’s a pretty good argument that this level of compat, while the source of some problems, has also made other things much easier: consider container images that are bundled with their own system libraries.

Last time I checked, Go did not run exclusively on linux. If it did, raw syscalls would indeed not be a concern (though even then they try to have their cake and eat it, as e.g. they want to do raw syscalls yet benefit from vDSO, which has been an issue in the past because their assumptions did not hold: https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/)

Reelin · on Dec 8, 2019

Perhaps "strict" would have been a better word choice than "severe".

It does seem that there's a very reasonable security concern about doing so from +w+x memory, or from non-PIE regions, or without first explicitly authorizing the calling range.

My question still stands regarding what the security concern of opt-in PIE -w+x code making direct syscalls is.

Edit: (Of course I do understand that the BSDs (unlike Linux) do not guarantee a stable syscall ABI and as such performing them directly is strictly a bad practice.)

jchw · on Dec 8, 2019

Windows did not enforce this either, and despite how bad of an idea it is, there have been software that do their own syscalls. Mostly tricky things like DRM or anticheat.

Still, I don’t think there’s anything wrong with letting an application mark part of its code safe for syscall execution, versus enforcing libc only. Seems like the exact same thing as the execute bit. Moreover, some systems genuinely have a stable syscall ABI - I think Linux would be considered one.

masklinn · on Dec 9, 2019

> Windows did not enforce this either

Windows pretty much enforces it in the sense that syscall numbers can change between minor updates, so raw syscalls breaks extremely often.

> Moreover, some systems genuinely have a stable syscall ABI - I think Linux would be considered one.

Linux is not a system. It's a kernel, with userlands you can bolt on. That's why it has a stable syscall ABI: that's the only interface Linux can have if it intends to provide an interface.

loeg · on Dec 9, 2019

Linux is basically the only system that targets a stable syscall ABI, and that's basically due to being loosely coupled with all other parts of a Linux-based operating system.

binarycrusader · on Dec 8, 2019

The issue is that unlike windows unices have never enforced this.

That's not true; Solaris and I suspect other *nixes of the past made it explicitly clear that libc was the interface for userland, not the kernel.

matheusmoreira · on Dec 9, 2019

> syscalls are rarely if ever intended to be called directly nilly-willy.

I don't think this is true, at least on Linux.

On Linux, the system call interface¹ is the documented interface to user space. Even the commonly used vDSO² is a stable interface. This is important because it means the popular C libraries are not part of the Linux kernel interface. Although glibc is often portrayed³ as some kind of Linux kernel wrapper, they are entirely separate projects. Linux manuals⁴ also make it seem like they are one and the same:

> The Linux man-pages project documents the Linux kernel and C library interfaces that are employed by user-space programs.

These same manuals also document systemd as if it was part of Linux. I went there expecting low level documentation useful for writing one's own init system and got systemd documentation instead. It's very confusing in my opinion. Why are external projects documented in the Linux manuals?

Anyway, these kernel features are used by C libraries to implement all their functions. Using C libraries is the traditional way to build a Linux user space but it is certainly not the only way. Compilers could emit these system calls directly, avoiding the need for a runtime library. A programming language virtual machine could be built directly on top of Linux system calls. It is possible to create freestanding programs that run on Linux with zero dependencies.

Incompatibilities are caused by these user space libraries, not by the system calls themselves. For example, glibc maintains a lot of thread local state and will not work correctly if the program calls clone(). A program that does not link to glibc does not have this limitation.

Although low level, Linux system calls are in many ways a simpler interface: their behavior is more precisely documented compared to POSIX; there is no need to deal with errno; there is no hidden C library functionality that's hard to understand; freestanding programs do not contain references to hundreds of hidden standard library symbols that implement obscure functionality.

The kernel itself containd a nolibc.h header[5] that it apparently uses for its own tools.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

[3]: https://en.wikipedia.org/wiki/File:Linux_kernel_System_Call_...

[4]: https://www.kernel.org/doc/man-pages/

[5]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

masklinn · on Dec 9, 2019

> I don't think this is true, at least on Linux.

My assertion pertains to systems. Linux is not a system.

> Even the commonly used vDSO² is a stable interface.

https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/

> [snip completely irrelevant everything else]

ori_b · on Dec 8, 2019

> What would be the security concern with allowing syscalls only from preauthorized (ie msyscall(2)) regions

Who preauthorizes them, and what prevents more from getting added?

Reelin · on Dec 8, 2019

Currently ls.so upon linking in libc; invoking msyscall() is only permitted once per process.

Seeing as msyscall() is itself a system call, any calls to it must themselves originate from an authorized region.

ori_b · on Dec 8, 2019

Correct. So, now, if you want to allow any program, and presumably its libraries, to register their regions of memory as being syscall-worthy you need to lift the called-once restrictions, which opens a hole for an attacker.

Reelin · on Dec 9, 2019

> you need to lift the called-once restrictions, which opens a hole for an attacker

I don't see how that's the case; my original question was essentially asking what that security hole is. Provided that syscall ability is opt-out by default and that only code subjected to ASLR is permitted to be authorized, it doesn't seem terribly risky to allow additional such regions to be registered. An exploit has to contend with ASLR either way; either by locating libc, or by locating some other authorized region within the current process.

ori_b · on Dec 9, 2019

ASLR isn't a complete solution. It's not that hard to find libc, so this is just another hurdle, not a full barrier. You're proposing weakening the barrier.

Reelin · on Dec 10, 2019

I'm not proposing anything, I'm asking for a concrete explanation of the supposed security hole. I agree that ASLR isn't a complete security solution and never implied otherwise.

AFAIU, the entire security benefit here is due to ASLR alone. If an exploit manages to track down libc, it can go right ahead and make all the system calls it wants. (Unless there's some other piece to the puzzle that I've missed? Is there something special about libc in particular?) As such, I still don't understand how the called-once restriction is supposed to meaningfully increase security - by the time you've found the msyscall() function, you've also found _all the others_ anyway.

ori_b · on Dec 11, 2019

> AFAIU, the entire security benefit here is due to ASLR alone. If an exploit manages to track down libc, it can go right ahead and make all the system calls it wants

It has to create the appropriate gadgets to generate function call sequences, and generating gadgets is hard.

ianlevesque · on Dec 8, 2019

I think the idea here is to just let the linker do it.

ori_b · on Dec 8, 2019

Which is why go is such a problem -- because the linker can't do it without hacks and guesswork, in the case of a static binary that dlopens libc.

I'm aware of the issues. I was a part of the discussions in the hut when they were being implemented.

asveikau · on Dec 9, 2019

That recent change in OpenBSD is indeed interesting, however, this doesn't have much to do with how go handles scheduling of goroutines, other than the fact that the words "go" or "syscall" appear in both places.

komuW · on Dec 8, 2019

> Unfortunately our current go build model hasn't followed solaris/macos approach yet of calling libc stubs, and uses the inappropriate "embed system calls directly" method,

Go, as of version 1.12,uses libSystem in Darwin to make syscalls.

1. https://golang.org/doc/go1.12#darwin

lonelappde · on Dec 8, 2019

Does this mean that they want to ban syscalls from everything but approved "fat client" libraries like libc? (And perhaps ban versions of libc that have bugs?) How is that implemented? I guess it's by only allowing syscalls if the calling code is in a special part of memory, and the OS can gatekeep access to that memory?

Reelin · on Dec 8, 2019

My understanding is that system calls can currently only be made from -w+x regions; attempting otherwise results in the process being killed.

The idea is to extend this protection to only allow system calls from expected address ranges, so that a successful exploit can't simply make raw calls but instead has to track down an existing authorized one (and thus contend with ASLR). To that end, the new call-once syscall msyscall(2) is added. The linker uses it to register libc.so with the kernel after randomly mapping it into the current process.

loeg · on Dec 9, 2019

> Does this mean that they want to ban syscalls from everything but approved "fat client" libraries like libc?

Yes.

> (And perhaps ban versions of libc that have bugs?)

There is only one OpenBSD libc.

jnwatson · on Dec 8, 2019

This scheduler is probably the most salient feature of Go, but is only indirectly described in the language specification.

Perhaps it is just me, but it seems all this user space rigamarole to map bits of execution onto cores points to an overall architecture “smell”. This should be performed and enabled by the OS.

You can see the seams between the OS and the go runtime tear a little whenever a library acquires an ownership lock where the thread id is recorded. In Go, computation moves freely between threads, so that lock doesn’t work (at least without special instructions to the runtime to lock that goroutine to a thread).

The whole POSIX threading model seems broken in this context.

kccqzy · on Dec 8, 2019

User-space threading is not broken. Windows even directly provides support for user-space scheduled threads[0]. The whole model isn't broken; rather, it's liberating. Once the application programmer gets rid of the idea that threads are expensive and starts creating thousands of them willy-nilly, these applications often benefit from a much simpler architecture and fewer bugs. All these complexities are pushed into the user-space scheduler. It's worth it.

[0]: https://docs.microsoft.com/en-us/windows/win32/procthread/us...

pcwalton · on Dec 8, 2019

I agree. Paul Turner at Google did a presentation at LPC in which he presented an alternative model that actually uses OS threads: https://blog.linuxplumbersconf.org/2013/ocw/system/presentat...

Unfortunately the work seems to have stalled out and never made it into the kernel. If that work actually makes it into the Linux kernel, then other languages like C++ and Rust that have more stringent runtime requirements could make uses of lightweight threading as well.

weberc2 · on Dec 8, 2019

What’s the advantage of lightweight threading features? I thought your position (based on comments elsewhere) was that kernel threads are roughly as fast as it gets? What am I misunderstanding?

pcwalton · on Dec 8, 2019

I think M:N threads mostly aren't worth the drawbacks right now, but those proposed kernel features would change the calculus significantly.

toolslive · on Dec 8, 2019

Threads are fast, when they are working for you. If not, you need to wait until the work gets scheduled... and thread context switching is slow.

gok · on Dec 8, 2019

POSIX threading is not broken, the Go scheduler just does a bunch of goofy things that aren't really supported. Moving stacks between threads breaks all kinds of things. A more idiomatic approach would be for the compiler to emit properly resumable functions, like most async/await implementations do.

duelingjello · on Dec 8, 2019

There is no one-size-fits-all approach.

LLVM IR has async/await and coroutines, but most real-world VMs and language static compilers cannot depend on such intrinsics because of their memory and execution models. For example, Pony's ORCA has unique memory barrier and execution models that wouldn't work with this approach, although it uses LLVM for compilation down to metal. This is why LLVM is a loose framework and collection of tools split into "middleware" passes, rather than a single monolith.

PS: According to its paper, ORCA is supposedly one of the fastest GCs for most use-cases. It beat Zulu's C4, Erlang BEAM and another one in a deathmatch. It's too bad it can't be extracted as a separate project or integrated into OpenJDK or LLVM without lots of work. Of course, no GC is better (I'm staring at you, Rust. :).

http://releases.llvm.org/8.0.0/docs/Coroutines.html

dboreham · on Dec 8, 2019

This is always a problem with green threads aka M:N threading.

mitchty · on Dec 8, 2019

The go runtime moves stacks between threads?

Oof that’s horrible, any pointers to the logic behind it? I’m curious the rationale.

benaadams · on Dec 9, 2019

There are two main ways to do async/concurrency where you release the thread to do other work while you are waiting.

1. stackless (async/await) where the operation becomes an inspectable object that you can choose what to do with (awaiting being suspend for completion) as taken by C#, C++, Python, JS, PHP, Swift and Rust

2. "with stack" where you pretend its not async; but this means when something else uses the thread you need to get the suspended operation's stuff off the thread; usually by not using the thread's stack at all and having it in the heap and just jumping into and out of these "off-thread" stacks; as used by Go and being looked at for Java (as Project Loom)

Interesting paper on it http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p136...

> While fibers may have looked like an attractive approach to write scalable concurrent code in the 90s, the experience of using fibers, the advances in operating systems, hardware and compiler technology (stackless coroutines), made them no longer a recommended facility.

Disadvantage of stackless is the extra boilerplate (e.g. async/await everywhere); though it also gives more control as the consumer of the operations (e.g. fanout and wait for many; or continue not waiting for the result at all)

Advantage of the "with stack" approach is it looks the same as non async code as its all hidden (goroutines aside); which is why Java is no doubt looking at doing it as there is a large body of code that would need to be rewritten so "hiding it" is easier to avoid that.

C# had/has teething issues when async/await was introduced as it kept the initial thread blocking methods; and added the async and they don't mix very well, you need really to go one way or the other when developing.

Javascript leapt at async/await as it was all async anyway, but callback based which makes for horrible code to follow; so it made everything much cleaner.

pcwalton · on Dec 8, 2019

The valid reasons for M:N threading are reducing syscalls on goroutine spawning, and avoiding the overhead of the kernel scheduler on context switch.

duelingjello · on Dec 8, 2019

I think to keep Go code directly callable from C, they have to follow the platform's C calling conventions which means the same stack layout. So for cooperative concurrency on a single thread to work, each Goroutine needs its very own stack. On Intel, that means saving stack pointers RSP and RBP (16 bytes) for each. Also, each will need memory allocated for its stack for the stack pointers to point to... another 8-16 bytes (pointer and length).

echlebek · on Dec 9, 2019

The gc compiler, used by the vast majority of Go developers, does not use the C calling convention.

https://golang.org/doc/faq#Do_Go_programs_link_with_Cpp_prog...

gok · on Dec 8, 2019

Like much of Go, it was likely done because it made it easier to recycle Plan 9 code.

enneff · on Dec 8, 2019

Not sure where you got this idea. The only significant part of the Go codebase that was inherited from Plan 9 was the C compilers, used to build the original Go compiler that was written (from scratch) in C. I think perhaps a hash table implementation was also brought over from P9. That stuff is all long gone now, though.

The idea that Go's scheduler design is somehow inherited from Plan 9 is ridiculous.

duelingjello · on Dec 8, 2019

Well, in general, POSIX threads are much more expensive (RAM) than some unit of minimal cooperative concurrency/parallelism, say Erlang "processes." The idea of using a threadpool isn't broken because an user-space "scheduler" decides which tasks to run on which threads. It might also decide how to scale or shrink the threadpool. Ultimately, only one thing can run on a processor at a given time, and that typical means a task structure containing at least two items if executing on an interpreted/p-code VM:

0. next unit of work (pointer/counter; instruction, function pointer, etc.)

1. task-local heap (pointer or structure)

2. operand stack (pointer or structure; for stack-oriented VMs only)

masklinn · on Dec 8, 2019

> Well, in general, POSIX threads are much more expensive (RAM) than some unit of minimal cooperative concurrency/parallelism, say Erlang "processes."

The vast majority of the "expense" is irrelevant as it's virtual memory and unlikely to ever be touched (and thus committed).

loeg · on Dec 9, 2019

Depends on your codebase. Userspace C libraries and programs, including libc, often store surprisingly large buffers on the stack.

For example, try setting 'ulimit -s 128' (128kB stack limit) and see how many C programs crash. Then try, say, 16. Go's default is 8 kB, raised from 4 kB in 1.2: https://golang.org/doc/go1.2#stack_size

Linux's default userspace stack limit is 8 megabytes for a reason — programs really do use it.

masklinn · on Dec 9, 2019

> for a reason

Not really, the 8MB limit was added back in '95 from a previous limit of "essentially none"[0] with a justification of

> Limit the stack by to some sane default: root can always increase this limit if needed.. 8MB seems reasonable.

Developers don't generally think about their stack size, especially for single-threaded programs[1] so the defaults need to be a sweet spot of not unnecessarily big (such that you can catch unbounded recursion) but not so small that you'd segfault more than a very small fraction of all programs.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/history/hist...

[1] which would be why e.g. OSX has a large main thread stack (8MB) and a relatively puny secondary thread stack (512k).

pcwalton · on Dec 8, 2019

Most of the memory cost of a POSIX thread is in the stack, and you can customize the stack size to be quite small. Small stacks are properly thought of as a property that GC enables, not a property that M:N threading enables.

masklinn · on Dec 8, 2019

> Most of the memory cost of a POSIX thread is in the stack, and you can customize the stack size to be quite small.

The problem there is that you need to very carefully size your stack as a mis-sizing will lead to a risky stack overflow. I'm not sure it's necessary either as allocating a "large stack" but using very little of it means most of it is never committed, and thus only costs memory mappings.

pcwalton · on Dec 8, 2019

If you have the runtime infrastructure to grow stacks, then you can use that with POSIX threads too.

masklinn · on Dec 9, 2019

Is there any systems where the C stack is growable? Do stack frames get prefixed with an explicit request for some amount of stack memory, leading to the stack possibly being moved before the funcall happens?

pcwalton · on Dec 10, 2019

We used to have growable C stacks in Rust using that technique. They worked (though were too slow for us). It could have been fixed by using stack copying like Go does.

sagichmal · on Dec 9, 2019

Is there a stack size configuration that would enable a program to spawn 1M POSIX threads?

pcwalton · on Dec 9, 2019

As I recall the minimum total user + kernel stack size is 10kB in the Linux kernel, so 1M threads is 10GB of space. It should be doable, though you will probably have to bump up kernel limits.

A million threads is an extreme case, though. No system can reliably spawn that many threads that are actually doing something interesting without a very large amount of memory. When you leave the realm of microbenchmarks you have to expect that an unknown quantity of threads will have deep call stacks at any given time, so you really need to give yourself leeway to avoid the risk of OOM.

sagichmal · on Dec 9, 2019

So a big advantage that Go's M:N goroutine model brings to the table is how cheap they are. Cheap enough that tons of concurrency-related stuff you want to do in application code, like implementing a highly-concurrent algorithm, can be done with goroutines directly, without having to think too hard about mechanical sympathy and e.g. translate logical concurrency to physical threading. Go processes commonly have 1M or even 10M active goroutines at once.

So I don't think it's fair to say POSIX threads are comparable or whatever if they don't have this property.

pcwalton · on Dec 10, 2019

Go processes do not typically have 1M or 10M active goroutines at once. The initial stack size for goroutines is 2kB, so 10M goroutines would mean 20GB just for stacks, even assuming that the goroutines never grow their stack (which cannot be assumed for anything nontrivial). The 2kB minimum stack size is on the same order of magnitude as the 10kB POSIX thread stack size.

sagichmal · on Dec 10, 2019

> Go processes do not typically have 1M or 10M active goroutines at once.

It depends on domain, but in my domain of high-RPS network servers, they absolutely do.

> 10M goroutines would mean 20GB just for stacks.

When I deploy to metal, the average host has ~512GB of RAM, and not much cotenancy.

> The 2kB minimum stack size is on the same order of magnitude as the 10kB POSIX thread stack size.

There's also a question of cost to create and destroy; it's very common for goroutines to live for O(µs). I don't know how POSIX threads compare here.

loeg · on Dec 9, 2019

The mismatch occurs because Go implements its own threading model, completely ignoring your operating system's implementation of userspace pthreads. If it then attempts to interact with programs using pthreads without taking special care, yeah, it can violate the pthreads API. Such Go programs are broken.

I'm not sure why this leads you to the conclusion that "the whole POSIX threading model seems broken."

derefr · on Dec 8, 2019

I would love to see a compare-and-contrast between the Golang scheduler and the Erlang scheduler, in the way they handle network-IO-heavy workloads. Maybe throw in the JVM scheduler, too (though its JIT would likely complicate things.)

tyingq · on Dec 8, 2019

Makes me somewhat curious how go deals with a hung NFS mount ("hard mount"). I suspect everything would stop, where a normal OS thread wouldn't hang if it weren't interacting with NFS.

siebenmann · on Dec 8, 2019

This should work fine. The goroutine making the system call that touches the NFS mount will consume an OS thread (an 'M' in Go terminology), but it will release its hold on other resources. Go uses as many OS threads as necessary to cope with running user code and doing OS system calls and so on (and starts new ones on demand).

If you had lots of goroutines do lots of things that stalled on hung NFS mounts, you would build up a lot of OS threads (all sitting in system calls) and might run into limits there. But that's inevitable in any synchronous system call that can stall.

(I'm the author of the linked-to article.)

amluto · on Dec 8, 2019

A side effect of this scheme is that a long sequence of slow-but-not-that-slow syscalls becomes extremely slow because the Go scheduler gets invoked each time.

pythux · on Dec 8, 2019

Nice read! I am not very familiar with this field of research but, could runtimes of other languages (say, Node.js or Python) benefit from such optimizations? What about libraries like libuv, I guess they must be fairly fine-tuned already? Or is this something that is specific to Go and would be hard in other contexts?

Mathnerd314 · on Dec 8, 2019

Haskell has had an I/O manager for doing asynchronous I/O for a while: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.649...

There's been some discussion about using libuv but no consensus: https://gitlab.haskell.org/ghc/ghc/issues/8400. It's available as a library: https://haskell-stdio.github.io/stdio/

I didn't re-read the papers but IIRC GHC just spawns an extra OS thread any time a possibly-blocking function is called, as it doesn't follow a strict M:N model. There's a thread pool to reduce overhead but it's probably not as efficient as Go's method.

kevingadd · on Dec 8, 2019

Go is one of the only languages that does syscalls itself (mostly because it's extremely high-risk and low-payoff), so some of its syscall-related techniques are not easily adapted to other runtimes.

pcwalton · on Dec 8, 2019

Note that even Go only does syscalls itself on Linux. On macOS and Windows it calls into libSystem and kernel32.dll respectively, as the syscall interface is not stable on those platforms.

masklinn · on Dec 8, 2019

> Note that even Go only does syscalls itself on Linux.

AFAIK Go does syscalls itself on any platform but Windows and macOS, this includes all BSDs. And even for macOS despite that having never been officially supported it took multiple breakages a few years back.

The first thread here mentions the issues that causes for openbsd.

loeg · on Dec 9, 2019

Go does (or did) bare syscalls on the BSDs as well, despite the syscall interface not being stable there.

earenndil · on Dec 8, 2019

ntdll is pretty close to stable. Technically not stable, but high-profile projects like chrome depend on it, so it's not likely to change at this point.

derefr · on Dec 8, 2019

Do JITing runtimes like the JVM or LuaJIT, just generate libc-syscall-wrapper calls in the emitted object code, then?

blattimwind · on Dec 8, 2019

Something like LuaJIT doesn't have the concept of a syscall, only calling into external C code. The same is probably(?¹) true for the JVM, since Java uses native methods to talk to the OS as well. So the JIT'ed code would call into the language runtime library, which in turn would call a syscall wrapper provided by the libc.

¹ It's possible the JVM, being the highly optimized workhorse VM it is, has specialized optimizations for I/O and does indeed skip over JNI and libc in these cases.

justincormack · on Dec 8, 2019

ljsyscall for example calls into the libc syscall() wrapper. Most ffi type APIs can’t generate assembly for calling directly.

fsfod · on Dec 8, 2019

In theory you could do it directly with the intrinsic system I built for LuaJIT[1]. It would dynamically generated the assembly for a user declared intrinsic\arbitrary machine opcode when there first called in the interpreter and the opcode is directly emitted when the code is JIT'ed. I think defining an intrinsic for a system call would just be a matter of setting the correct input and output registers.

[1] https://github.com/LuaJIT/LuaJIT/pull/116

cesarb · on Dec 8, 2019

For the JVM: AFAIK, the JIT-generated code (and the interpreter) never does system calls, either directly or indirectly through the C library. Instead, they call "native" code written in C or C++, and it's that native code which does all the system calls or equivalent.

lonelappde · on Dec 8, 2019

This is presumably because the authors of Go are also Unix implementers or close to it. It's interesting to see the philosophy extended to non-Unix deployments.

duelingjello · on Dec 8, 2019

1. Does OS thread M get pinned to run only on a particular processor P? (It seems like "yes" when default.)

2. If M blocks in a syscall too long in the optimistic case:

2.a. is M unpinned from P but continues to block until the syscall returns?

2.b. is another thread from the pool used or new thread created, and pinned to P so that P can be used for other work? (I think this depends on configuration if there are fewer, same or more threads than processors.)

2.c. is there an upper limit on outstanding blocked syscall worker threads or will it simply be the last task any extra created threads beyond the normal limit would ever process?

siebenmann · on Dec 8, 2019

An OS thread M can run on any available P. While there are some caches associated with each P, Ps are fundamentally there to insure that only so many CPUs worth of Go user code is ever running at once, so the important thing is that an M that wants to run user Go code has some P, not a particular P. Ms claim and release Ps as they go in and out of running Go user code, but I believe they don't release and then re-acquire a P as they switch between goroutines.

(I believe the actual implementation treats Ms as a sort of secondary thing. For instance, I think that the local list of runnable goroutines is attached to the P, not to the M. At one level, the M is just a context for running things on Ps.)

In the optimistic case when the system call blocks for too long, the M is unpinned from the P it was using and continues to sit in the system call (the Go runtime doesn't attempt to interrupt the system call itself). If there is another runnable goroutine and there are no free M's, the Go scheduler will create another M to run the goroutine on the now-free P. I think that the runtime directly allocates the free P to the newly created M rather than letting the new M try to contend with other things for the P, but I'm not sure.

I don't think there's any limit on the number of Ms (OS threads) that the Go runtime will create, but I haven't checked the code carefully. Idle Ms are reclaimed under some circumstances.

(I'm the author of the linked-to article.)

jancsika · on Dec 8, 2019

> For example, on modern systems the 'system call' to get the current time may not even enter the kernel (see vdso(7) on Linux).

Is there a way to check from the running process whether that is the case or not?

wyldfire · on Dec 8, 2019

Check whether it happened to do that for a given call or a portable way to predict whether it will?

dump_vdso [1] will write the vdso to stdout, you can use binutils like objdump or nm to list the symbols present.

[1] https://kernel.googlesource.com/pub/scm/linux/kernel/git/lut...

monocasa · on Dec 9, 2019

And if you wanted the same information at runtime, the base address of the VDSO is passed in as an auxv, and then passing that address into libelf would get you everything.