Hacker News new | past | comments | ask | show | jobs | submit login
On the Costs of Syscalls (2021) (gms.tf)
105 points by K0nserv 9 months ago | hide | past | favorite | 87 comments



This micro benchmark measures the direct cost of system calls. But flushing the TLB and trashing the caches carries an indirect cost too.

According to some benchmarks I've seen (sorry no link), a system call in the middle of a memory heavy inner loop can impact performance for 100 us before performance is back to steady state. This is 100x longer than the system call alone. Of course there is work being done, but at a reduced throughput. This is in line with my practical experience from working with performance sensitive code.

These things are difficult to benchmark reliably and draw actionable conclusions from the results.

This is not a criticism to the author of this article, the article clearly describes the methodology of the benchmarks and does not suggest any wrong conclusions from the data.


It's a bit old and doesn't include recent microarchitectural changes, but Section 2 of the FlexSC paper from 2010 (https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...) has a detailed discussion of these indirect effects. I especially like how they quantify the indirect effects by measuring user code's IPC after the syscall.


I think that's the benchmarks I allude to in the GP post.

Table 1 on page 3 is absolute gold, it quantifies the indirect costs by listing the number of cache lines and TLB entries evicted. The numbers are much larger than I remembered.

According to the table, the simplest syscall tested (stat) will evict 32 icache lines (L1), a few hundred dcache lines (L1), hundreds of L2 lines and thousands of L3 lines, and about twenty TLB entries.

After returning from said syscalls, you'll pay a cache miss for every line evicted.

Also worth noting that inside the syscall, the instructions per clock (IPC) is less than 0.5. When the CPU is happy, you generally see IPC figures around 2 to 3.


Yeah.. FlexSC / Soares is my favorite paper from OSDI 2010. The system call batching with "looped" multi-call they mention there relates to the roughly 30 line (not actually looping) assembly language in my other comment here (https://news.ycombinator.com/item?id=39189135) and in a few ways pre-saged io_uring work.

Anyway, a 20-line example of a program written against said interpreter is https://github.com/c-blake/batch/blob/1201eefc92da9121405b79... but that only needs the wdcpy fake syscall not the conditional jump forward (although that could/should be added if the open can succeed but the mmap can fail and you want the close clean-up also included in the batch, etc., etc.).

I believe Cassyopia (also mentioned in Soares) hoped to be able to analyze code in user-space with compiler techniques to automagically generate such programs, but I don't know that the work ever got beyond a HotOS paper (i.e. the kinda hopes & dreams stage) and it was never clear how fancy the anticipated batches being. The Xen/VMware multi-calls Soares2010 also mentions do not seem to have inline copy/jumps, though I'd be pretty surprised if that little kernel module is the only example of it.


it really seems like the os should just take a core for itself


> According to some benchmarks I've seen (sorry no link), a system call in the middle of a memory heavy inner loop can impact performance for 100 us before performance is back to steady state. This is 100x longer than the system call alone.

Do you some intuition of what causes this? I don't have how much experience with this kind of work, but 100µs is enough time to do hundreds of random memory accesses. How can a single syscall do so much damage to the cache?


As the GP mentioned, it's TLB cache invalidation that can be the problem. Reloading the mappings from virtual (per-process) addresses to physical addresses after returning from the kernel can cause delays (on some architectures), even if most of the physical memory is still in the cache and valid.

https://en.wikipedia.org/wiki/Translation_lookaside_buffer

(Also, worth pointing out that they didn't claim a delay of 100µs, just that some (presumably, much smaller, on the order of ns?) delays can show up up to 100µs later before "steady state" is fully restored.)


As syscalls run more code and access more data they can increase TLB pressure in the same way the increase general cache pressure, but syscalls per-se (outside of things like munmap) don't typically invalidate the TLB as the mapping is not changed on an user-space kernel-space transiation, only the access right (there were exceptions like the brief period when people where running with full 4GB user address space on 32bits cpu).


> but syscalls per-se (outside of things like munmap) don't typically invalidate the TLB as the mapping is not changed on an user-space kernel-space transiation, only the access right

Isn't that exactly what changed for the Meltdown/Spectre mitigations? That it is invalidated now?


I think you are referring to KTPI, but a) it might not be needed on recent CPUs that had meltdown-like issues patched, and b) in any case on CPUs that can tag pages with process identifiers (most of them these days), it can avoid TLB flushes most of the time.

But I don't claim any specific knowledge of on these mitigations.


Is the PID different when entering kernel space but still doing work for the calling PID?


I don't think it is different, but it doesn't matter, the kernel can put whatever it wants in the register that is used to match the process identifier tag in the TLB.


TLB Flushing and dcache/icache evicting hot cache lines in favor of the instructions and data needed by kernel mode.

It takes a while of normal operation until these are populated again with the hot data, until then the system overall throughput is reduced.


The best part of this thread was the link to the discussion on the removal of pid caching in `getpid` causing regressions in systemd https://bugzilla.redhat.com/show_bug.cgi?id=1469670

I've been on a code-deleting rampage lately, killing off many minor features that add a lot of complexity for seemingly little gain. Glad I don't have anyone interrogating me like this!


The snark here is impressive:

> in order to quickly fix this pressing matter, I've attached code that I believe will fix your issue. It uses a custom caching mechanism and fits well into the systemd ecosystem.

    pid_t systemd_getpid(void) {
     return 0;
    }


It's kind of a dick answer, imo.


I'm really not sure if it's perfectly context sensitive or too much even for systemd.

But dick answers is the specialty of the systemd development team. I wouldn't want to get into a contest against them.


That one had me laughing really hard haha


That was interesting - "Umm, this change causes major slowdowns in various bits of systemd code, which assumes that getpid() is fast. libsystemd uses getpid() to detect when processes fork". A regular userspace program doesn't need to detect when a fork() has occurred since its whats doing the fork().


Caching the result of getpid causes you to need to change the value after it's forked. So fork must be wrapped if getpid is wrapped.


While interesting to read, the title should be "On the Costs of Linux Syscalls"


Yeah this paragraph stands out as a bit strange

"I ran the benchmark on a heterogeneous set of hosts, i.e. different kernels, operating systems and configurations"

In the fine print: the "different OS:es" were two Linux Distributions which are (if I understand correctly) not even that far apart tech wise? (RHEL and Fedora).


And even more specifically "On the Costs of Linux Syscalls on Intel processors".

I would like to see this extended to other processor families.


There are also RTOS-capable microkernels such as seL4[0], with few but extremely fast syscalls[1]. Note times are in cycles, not ns.

0. https://sel4.systems/

1. https://sel4.systems/About/Performance/


AFAIK sel4 is similar now as most of the weight comes from spectre/meltdown mitigations from flushing the TLB on privilege changes.


I tested the performance of system calls on Windows 10 by repeatedly calling NtClose on an invalid handle. A system call took around 1200 CPU cycles.


Which timing mechanism did you use?


This was a while ago and I don't remember the exact details...

I had two loops with a large number of operations. One called a trivial Win32 function (GetACP), the other made a System Call (NtClose) as well as the same trivial Win32 function. I used QueryPerformanceCounter to time the loops.

From there, I estimated the number of machine cycles per iteration (using my clock speed of the time), and subtracted the "trivial" version which made no system calls.


Clever approach!


Peeking at "ntoskrnl.exe" in Ghidra, it appears that "NtVdmControl" is a true "do-nothing" system call in Windows 10 x64, and Windows 11 removed the function entirely.


The measured mitigation cost overhead is very interesting. Those XG machines have very fast syscalls (<100ns), but are more than twice as slow with mitigation on, even though they are fairly recent and most meltdown/spectre bugs should have been patched out.


Now that's interesting. I remember years ago profiling the gettimeofday() function on various platforms. It was fine on all except our AWS VMs. Turns out it wasn't implemented as a vDSO and the performance was all over the place. That's been fixed since.

We use getpid() in logging and maybe it's unnecessary.


Any good resources/deeper dives into the details of what actually happens on modern computers when making these syscalls on, say, linux? I might have a reasonable idea about what goes on when performing a mode switch or context switch, but I’d love to have a reference or a nice walkthrough.


My introduction to this was the 'Direct Execution' chapter of the book 'Operating Systems: Three Easy Pieces'.

It's a fantastic write up on not just how system calls work, but the motivation behind why operating systems even implement them in the first place. It's not specifically about Linux, but the book is clearly heavily inspired by early *nix designs and it's still applicable.

You can read it for free here: https://pages.cs.wisc.edu/~remzi/OSTEP/cpu-mechanisms.pdf


I wrote an article about Linux system calls, albeit from the user's perspective, detailing why you might want to use them and how to do so.

https://www.matheusmoreira.com/articles/linux-system-calls

LWN has the kernel's perspective well covered by their articles on the anatomy of Linux system calls:

https://lwn.net/Articles/604287/

https://lwn.net/Articles/604515/

https://lwn.net/Articles/604406/

They are thoroughly dissected in these articles. I also use them as a reference.


I wrote a detailed walk-through of the illumos syscall handler back in 2016. It's missing the updates around KPTI (meltdown mitigation), but other than that the mechanisms should be unchanged.

https://zinascii.com/2016/the-illumos-syscall-handler.html



> It’s not a great idea to call system calls by writing your own assembly code.

> One big reason for this is that some system calls have additional code that runs in glibc before or after the system call runs.

That's not really the fault of the system calls. The problem is the C library itself. It's too stateful, it's full of global data and action at a distance. If you use certain system calls, you invalidate its countless assumptions and invariants, essentially pulling the rug from under it.

I've found programming in freestanding C with Linux system calls and no libc to be a very rewarding experience. Getting rid of libc and its legacy vastly improves C as a language. The Linux system call interface is quite clean. Don't even have to deal with the nonsense that is the errno variable.


Sounds roughly equivalent to discussing how much more rewarding it is to walk 1000km than to drive it.

Yes, everybody knows you can get from A to B on foot. But for multiple reasons, and over and extended period of time, we developed other methods of doing so. That doesn't invalidate the experience of doing it on foot, but it does make the walk into a very conscious choice that probably isn't the one most people are going to make most of the time.


But the internal details of libc leaking is accidental, not essential, complexity for the problem at hand. There's the baseline level of difficulty with writing assembly language, but the libc influence is imposed on top of that.


The internal details of libc are not leaking in the case under discussion.

libc wraps system calls. If you use libc and try to make your own system calls, you're going to collide with internal details of libc.

Not using libc is fine. Not making your own system calls via asm is fine. Pick one.


If the policy of the OS is to for programmers to only use libc for syscalls, perhaps that's a valid caveat to asm syscalls, but I don't think Linux is such an OS.


> I don't think Linux is such an OS.

It's not! I wrote somewhat at length about that very question in this article:

https://www.matheusmoreira.com/articles/linux-system-calls


It's not the policy of the OS.

It's the policy of (some/most/all) libc implementations. Don't like it? Find a different libc or do without libc.


It absolutely is a policy of the OS and whoever is in charge of it. Nearly all of them provide their own libraries and decree that the only supported way to interface with the system is via those libraries. So you cannot "find a different libc or do without libc" on the vast majority of operating systems out there. You literally cannot get away from the OS libraries. If you try, your software inevitably breaks when they change system call numbers and semantics. Golang found out the hard way. Here's an awesome quotation from a BSD guy:

> I have altered the ABI. Pray I do not alter it any further.

Linux is actually the odd one. It's the only one with a stable language-agnostic system call ABI. It's stable because breaking the ABI makes Linus Torvalds send out extremely angry emails to the people involved until they fix it. Because of this stability, you actually can have alternative libc implementations like musl and you can also rewrite literally everything in Rust or Lisp if you want. Only on Linux can you do this.

For some reason people try extra hard to make it look like glibc is some integral part of the kernel. Diagrams on Wikipedia showing glibc enveloping the kernel like it's some outer component. Look up Linux manuals and you somehow get glibc manuals instead. The truth is they're completely independent projects. The GNU developers have zero power to force anyone to use glibc on Linux. You can make a freestanding application using system calls directly and literally boot Linux into it. I have made it my goal to create an entire programming language and ecosystem centered around that exact concept.


You are using the term "OS" to mean something other than "kernel", which is the meaning required by this thread (and arguably entire topic) thus far.

This is demonstrated by the rest of your comment, in which you note that linux has a stable, language agnostic system call ABI, and that other libc implementations exist.

TFA was ONLY about Linux syscalls, so what other platforms do or do not is irrelevant in context.


I understand what you mean. It's just that I object on principle to this notion that using system calls is "wrong" or "something normal people aren't supposed to do" as if the libc programmers have special privileges. It's these constant warnings about and fearmongering around the system calls that made me want to use them in the first place.


The warnings etc. are for people who, more likely than not, are using a libc implementation that depends upon its own wrapping of syscalls. As such, they are entirely appropriate.

Do they apply to all possible uses of syscalls? They do not.


Is the cost of syscalls expensive for all OS's (e.g. BSD, Redox, etc)?

Does OpenBSD limiting how you can call a syscall help at on performance (or hurt)?


Anyone know why getpid/getuid still aren’t implemented as vdso?


Because what getpid returns can change during the lifetime of a process, if it moves to a different PID namespace. I guess updating the vdso every time that happens is not worth the little gains.


> ... can change during the lifetime of a process, ...

Fun fact: this is already an issue for the vdso.

gettimeofday(2) can use the CPU builtin cycle counters, but it needs information from the kernel to convert cycles to an actual timestamp (start and scale). This information too can change during the runtime.

To not have userspace trampled over by the kernel, the vdso contains an open-coded spinlock that is used when accessing this information.

I learnt about this while debugging a fun issue with a real-time co-kernel where the userspace thread occasionally ended up deadlocking inside gettimeofday and triggering the watchdog :-)


Does avoiding userspace being trampled by the kernel mean avoiding a context switch? Does the kernel write to userspace-accessible memory that the spinlock guards?

Unrelated, but what does "open-coded" mean? I never seem to find an obvious answer online.


> Unrelated, but what does "open-coded" mean? I never seem to find an obvious answer online.

In my understanding, open-coded means something akin to "manually inlined". Or: written inline while an acceptable alternative exists as a function.


When are you ever calling getuid in a tight loop?


GP's question is perfectly valid. It doesn't have to be called within a tight loop at all. It could be a latency sensitive routine for example. I can see at least a few real life applications where it would make sense to have the syscall vDSO'd.


What's one of them?


https://bugzilla.redhat.com/show_bug.cgi?id=1469670

I would appreciate if someone explained why it’s not a vdso rather than just repeating it’s not necessary like that’s a sufficient explanation.


It’s not a VDSO because it was never added as one. “Not a VDSO” is the default, a VDSO is additional complexity and maintenance so you need to justify its addition / existence, not its non-existence.

Also given https://ipfs.io/ipfs/QmdA5WkDNALetBn4iFeSepHjdLGJdxPBwZyY47i... while Linus may have changed tack in the decades since I would not expect much support from stating getpid is performance critical to you.


I think we're just talking past each other. I'm asking why you'd want to vDSO getuid(), not getpid(). I can imagine cases where you'd repeatedly be calling getpid().


Eh. All of these things requiring a context switch seems alien to me. So it’s less of a use case and more “just make it fast” since the incremental cost should be 0?


Well, the incremental cost isn't zero.


From the linked thread: > The most difficult issue to get right is cache invalidation.

For sure, right up there with naming things and off-by-1s.


VDSO is designed for things that should be fast and are called often. There is absolutely no reason to call those functions often.

The only time your PID would change from a previous value obtained from that very same function is in the child, after a call to fork(), which is itself so heavy, that it would dwarf your calls to getpid


Is it a scarce resource? For what it’s worth when the caching was removed from glibc it caused a performance problem in libsystemd which had a tight loop to detect when a process was forked. So while this may be true in many cases, there’s clearly cases where it’s not and your answer doesn’t explain why getpid isn’t a vdso.


> which had a tight loop to detect when a process was forked.

This sounds so mad it needs more context. You don't need to detect when a process is forked, it's an action issued from inside the process?


Maybe the problem is that it is "lib"systemd, meaning it doesn't really have control over when the process forks, it's the application code's responsibility. Although I would probably also avoid this design, but that would probably make the library interface a bit unfriendly, with the application needing to pass its own pid to library functions where appropriate.


The audacity of the design to detect a pid change by looking at it in a loop, is interesting. I don't think I could have come up with something like that. Feels very "sync the web DOM by traversing it and comparing with a shadow DOM".


If only we had an API to tell us when a fork happened. We could place it into the usual library for threading stuff: pthread, and could call it, oh I don't know, maybe pthread_atfork()

https://man7.org/linux/man-pages/man3/pthread_atfork.3.html


It is mentioned in the glibc bug tracker entry as a replacement to the polling. systemd didn't use it because they thought it would be impossible for a library using to be dlclose'd as there is no way to unregister the callback. But apparently glibc has (undocumented) magic under the hood to take care of it.


Memory is always a scarce resource. However, a big you make the vdso, that amount of memory, multiplied by the number of processes is the amount wasted. In reality, it will be more, because of the required additional page tables to map it at an address range that is not commonly used.

Also, what sort of raving lunatic had a busy loop checking the current process PID? Who allowed such a person access to a compiler? That seems like the bigger problem.


I would assume good faith on the part of systemd developers. Afaik they’re as talented as any other top tier group of developers.

Somehow I doubt the code for getpid as a vdso would meaningfully impact the amount of memory used on a running system nor would I expect any meaningful impact to the number of page table entries. Do you have any supporting evidence for your claim otherwise?


Good faith, yes. But even talented people get in strange waters sometimes.


No, libsystemd did this because it’s a library. They went with pthread_at_fork as an alternative to detect when the application code forked, but it’s not without its own set of risks. Of course they could have required the user of the API supply the PID everywhere to avoid this but it’s possible even that wouldn’t solve the problem and it complicates the UI and the UI of the library API is equally important as other technical details - it’s a design tradeoff and judging it as some kind of correctness thing without having a clear picture and insulting the authors out of hand is weird behavior.


I agree with everything you said except that "systemd" and "insulting" don't go together like peanut butter and jelly. Insults seem to be the fuel the project is run on.


>multiplied by the number of processes

Why aren't the pages shared like other dynamic libraries?


If we put the PID into the vDSO, this data page can no longer be shared among different processes (at least not in the same container).

We could use the rseq area for PID and TID.


Because you just asked to put data there that would differ per process?


That doesn't mean that code pages couldn't be shared. The data that differs per process is a single number in this case.

Having said that I agree that it's probably not worth it to put getpid() into vDSO.


A number here, a number there, before you know it, it is a lot.

If we're going to add all sorts of dross to the vdso, like getpid(), who knows what else will end up there?


> The only time your PID would change from a previous value obtained from that very same function is in the child, after a call to fork()

Or clone3, but with PID namespaces the answer to "what is the PID of a process" is a trickier question.


Library code sometimes needs to detect fork without being in control of when the application calls fork. Think things like CSPRNGs. The state of the art here is to put the CSPRNG state in a memory region that is set to zero on fork (madvise MADV_WIPEONFORK), instead of using getpid.


Well, would you look at that, yet another solution that doesn’t require a busy loop burning watts of energy.


Nothing busy looped getpid. A library used getpid on each entry to check if the process had forked. The library was called many times.

getpid was a fairly common way to check for fork prior to mechanisms like madvise WIPEONFORK (Linux 4.14, released in 2017). systemd existed prior to 2017 and likely needed to support older kernels past 4.14's initial release. And the getpid cache was something glibc had prior to April 2017, likely to support this kind of use case. It was even documented behavior[1].

So, anyway, what systemd was doing wasn't totally stupid. It became a lot more expensive when glibc broke it on them and then it needed to be improved. But before that it wasn't as objectionable as people in this thread are suggesting.

(Also: pthread_atfork requires linking pthreads, which is or at least historically was seen as a significant burden on applications which do not use pthreads.)

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1469670#c9


linux can set the uid per thread, afaik the vdso page is shared across threads.


Using https://github.com/c-blake/batch/blob/master/examples/closes... from the examples/ directory of https://github.com/c-blake/batch, I get:

    L2:bat/examples$ BATCH_EMUL=1 ./closes|awk '{print $3,$2,$1*$2}'|fitl -s0 -c,=,n,b -b100 1 2 3
    $1= 3.815 *$2 + 695.6826 *$3
    bootstrap-stderr-corr matrix
        0.8872    -0.7856
                  0.04174

    L2:bat/examples$ BATCH_EMUL=1 ./closes|awk '{print $3,$2,$1*$2}'|fitl -s0 -c,=,n,b -b100 1 2 3
    $1= 4.528 *$2 + 695.3265 *$3
    bootstrap-stderr-corr matrix
         1.174    -0.7640
                  0.05302

    L2:bat/examples$ a (695.3265 +- 0.05302)-(695.6826 +- 0.04174)
    -0.356 +- 0.067    # i.e. run-to-run consistent
    L2:bat/examples$ ./closes|awk '{print $3,$2,$1*$2}'|fitl -s0 -c,=,n,b -b100 1 2 3
    $1= 779.67 *$2 + 48.861 *$3
    bootstrap-stderr-corr matrix
         1.550    -0.7495
                   0.2023

    L2:bat/examples$ ./closes|awk '{print $3,$2,$1*$2}'|fitl -s0 -c,=,n,b -b100 1 2 3
    $1= 787.77 *$2 + 48.745 *$3
    bootstrap-stderr-corr matrix
         1.350    -0.6207
                   0.1635

    L2:bat/examples$ a (48.861 +- 0.2023)-(48.745 +- 0.1635)
    0.12 +- 0.26    # i.e. run-to-run consistent
    L2:bat/examples$ a (695.3265 +- 0.05302)/(48.745 +- 0.1635)
    14.265 +- 0.048    # i.e. batching makes it 14X faster!
One can go further to get a mean +- std.err of ~695.52 +- ~0.04 nanosec hot-cache time per syscall overhead or even do point-weighting there, but it's probably more important to check fit residuals for serial autocorrelation (very do-able with https://github.com/c-blake/fitl, EDIT: oh, and if you care `a` is basically shown in https://github.com/SciNim/Measuremancer/pull/12 which uses standard error propagation).

The bigger point would not be ever more careful measurement of the cost if you do not, but rather the ease of "batchifying" if you do of any highly regular call interface similar to syscalls with high latency and also the "follow on" API design, granularity-wise. The inner code of a mini-assembly language letting you implement, say, open/fstat/mmap/close all in one syscall crossing is only about 30 lines of C. { Yes, yes..It could use multi-CPU-arch and syscall auditing integration.. and sure, ebpf & io_uring alter the Linux landscape these days }. Also, @exDM69 is very correct that pure hot cache-hot loop numbers are only one part of the cost story (https://news.ycombinator.com/item?id=39188551).


A website from French Antarctica? Cool.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: