This micro benchmark measures the direct cost of system calls. But flushing the TLB and trashing the caches carries an indirect cost too.
According to some benchmarks I've seen (sorry no link), a system call in the middle of a memory heavy inner loop can impact performance for 100 us before performance is back to steady state. This is 100x longer than the system call alone. Of course there is work being done, but at a reduced throughput. This is in line with my practical experience from working with performance sensitive code.
These things are difficult to benchmark reliably and draw actionable conclusions from the results.
This is not a criticism to the author of this article, the article clearly describes the methodology of the benchmarks and does not suggest any wrong conclusions from the data.
It's a bit old and doesn't include recent microarchitectural changes, but Section 2 of the FlexSC paper from 2010 (https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...) has a detailed discussion of these indirect effects. I especially like how they quantify the indirect effects by measuring user code's IPC after the syscall.
I think that's the benchmarks I allude to in the GP post.
Table 1 on page 3 is absolute gold, it quantifies the indirect costs by listing the number of cache lines and TLB entries evicted. The numbers are much larger than I remembered.
According to the table, the simplest syscall tested (stat) will evict 32 icache lines (L1), a few hundred dcache lines (L1), hundreds of L2 lines and thousands of L3 lines, and about twenty TLB entries.
After returning from said syscalls, you'll pay a cache miss for every line evicted.
Also worth noting that inside the syscall, the instructions per clock (IPC) is less than 0.5. When the CPU is happy, you generally see IPC figures around 2 to 3.
Yeah.. FlexSC / Soares is my favorite paper from OSDI 2010. The system call batching with "looped" multi-call they mention there relates to the roughly 30 line (not actually looping) assembly language in my other comment here (https://news.ycombinator.com/item?id=39189135) and in a few ways pre-saged io_uring work.
Anyway, a 20-line example of a program written against said interpreter is https://github.com/c-blake/batch/blob/1201eefc92da9121405b79... but that only needs the wdcpy fake syscall not the conditional jump forward (although that could/should be added if the open can succeed but the mmap can fail and you want the close clean-up also included in the batch, etc., etc.).
I believe Cassyopia (also mentioned in Soares) hoped to be able to analyze code in user-space with compiler techniques to automagically generate such programs, but I don't know that the work ever got beyond a HotOS paper (i.e. the kinda hopes & dreams stage) and it was never clear how fancy the anticipated batches being. The Xen/VMware multi-calls Soares2010 also mentions do not seem to have inline copy/jumps, though I'd be pretty surprised if that little kernel module is the only example of it.
> According to some benchmarks I've seen (sorry no link), a system call in the middle of a memory heavy inner loop can impact performance for 100 us before performance is back to steady state. This is 100x longer than the system call alone.
Do you some intuition of what causes this? I don't have how much experience with this kind of work, but 100µs is enough time to do hundreds of random memory accesses. How can a single syscall do so much damage to the cache?
As the GP mentioned, it's TLB cache invalidation that can be the problem. Reloading the mappings from virtual (per-process) addresses to physical addresses after returning from the kernel can cause delays (on some architectures), even if most of the physical memory is still in the cache and valid.
(Also, worth pointing out that they didn't claim a delay of 100µs, just that some (presumably, much smaller, on the order of ns?) delays can show up up to 100µs later before "steady state" is fully restored.)
As syscalls run more code and access more data they can increase TLB pressure in the same way the increase general cache pressure, but syscalls per-se (outside of things like munmap) don't typically invalidate the TLB as the mapping is not changed on an user-space kernel-space transiation, only the access right (there were exceptions like the brief period when people where running with full 4GB user address space on 32bits cpu).
> but syscalls per-se (outside of things like munmap) don't typically invalidate the TLB as the mapping is not changed on an user-space kernel-space transiation, only the access right
Isn't that exactly what changed for the Meltdown/Spectre mitigations? That it is invalidated now?
I think you are referring to KTPI, but a) it might not be needed on recent CPUs that had meltdown-like issues patched, and b) in any case on CPUs that can tag pages with process identifiers (most of them these days), it can avoid TLB flushes most of the time.
But I don't claim any specific knowledge of on these mitigations.
I don't think it is different, but it doesn't matter, the kernel can put whatever it wants in the register that is used to match the process identifier tag in the TLB.
I've been on a code-deleting rampage lately, killing off many minor features that add a lot of complexity for seemingly little gain. Glad I don't have anyone interrogating me like this!
> in order to quickly fix this pressing matter, I've attached code that I believe will fix your issue. It uses a custom caching mechanism and fits well into the systemd ecosystem.
That was interesting - "Umm, this change causes major slowdowns in various bits of systemd code, which assumes that getpid() is fast. libsystemd uses getpid() to detect when processes fork".
A regular userspace program doesn't need to detect when a fork() has occurred since its whats doing the fork().
"I ran the benchmark on a heterogeneous set of hosts, i.e. different kernels, operating systems and configurations"
In the fine print: the "different OS:es" were two Linux Distributions which are (if I understand correctly) not even that far apart tech wise? (RHEL and Fedora).
This was a while ago and I don't remember the exact details...
I had two loops with a large number of operations. One called a trivial Win32 function (GetACP), the other made a System Call (NtClose) as well as the same trivial Win32 function. I used QueryPerformanceCounter to time the loops.
From there, I estimated the number of machine cycles per iteration (using my clock speed of the time), and subtracted the "trivial" version which made no system calls.
Peeking at "ntoskrnl.exe" in Ghidra, it appears that "NtVdmControl" is a true "do-nothing" system call in Windows 10 x64, and Windows 11 removed the function entirely.
The measured mitigation cost overhead is very interesting. Those XG machines have very fast syscalls (<100ns), but are more than twice as slow with mitigation on, even though they are fairly recent and most meltdown/spectre bugs should have been patched out.
Now that's interesting. I remember years ago profiling the gettimeofday() function on various platforms. It was fine on all except our AWS VMs. Turns out it wasn't implemented as a vDSO and the performance was all over the place. That's been fixed since.
We use getpid() in logging and maybe it's unnecessary.
Any good resources/deeper dives into the details of what actually happens on modern computers when making these syscalls on, say, linux? I might have a reasonable idea about what goes on when performing a mode switch or context switch, but I’d love to have a reference or a nice walkthrough.
My introduction to this was the 'Direct Execution' chapter of the book 'Operating Systems: Three Easy Pieces'.
It's a fantastic write up on not just how system calls work, but the motivation behind why operating systems even implement them in the first place. It's not specifically about Linux, but the book is clearly heavily inspired by early *nix designs and it's still applicable.
I wrote a detailed walk-through of the illumos syscall handler back in 2016. It's missing the updates around KPTI (meltdown mitigation), but other than that the mechanisms should be unchanged.
> It’s not a great idea to call system calls by writing your own assembly code.
> One big reason for this is that some system calls have additional code that runs in glibc before or after the system call runs.
That's not really the fault of the system calls. The problem is the C library itself. It's too stateful, it's full of global data and action at a distance. If you use certain system calls, you invalidate its countless assumptions and invariants, essentially pulling the rug from under it.
I've found programming in freestanding C with Linux system calls and no libc to be a very rewarding experience. Getting rid of libc and its legacy vastly improves C as a language. The Linux system call interface is quite clean. Don't even have to deal with the nonsense that is the errno variable.
Sounds roughly equivalent to discussing how much more rewarding it is to walk 1000km than to drive it.
Yes, everybody knows you can get from A to B on foot. But for multiple reasons, and over and extended period of time, we developed other methods of doing so. That doesn't invalidate the experience of doing it on foot, but it does make the walk into a very conscious choice that probably isn't the one most people are going to make most of the time.
But the internal details of libc leaking is accidental, not essential, complexity for the problem at hand. There's the baseline level of difficulty with writing assembly language, but the libc influence is imposed on top of that.
If the policy of the OS is to for programmers to only use libc for syscalls, perhaps that's a valid caveat to asm syscalls, but I don't think Linux is such an OS.
It absolutely is a policy of the OS and whoever is in charge of it. Nearly all of them provide their own libraries and decree that the only supported way to interface with the system is via those libraries. So you cannot "find a different libc or do without libc" on the vast majority of operating systems out there. You literally cannot get away from the OS libraries. If you try, your software inevitably breaks when they change system call numbers and semantics. Golang found out the hard way. Here's an awesome quotation from a BSD guy:
> I have altered the ABI. Pray I do not alter it any further.
Linux is actually the odd one. It's the only one with a stable language-agnostic system call ABI. It's stable because breaking the ABI makes Linus Torvalds send out extremely angry emails to the people involved until they fix it. Because of this stability, you actually can have alternative libc implementations like musl and you can also rewrite literally everything in Rust or Lisp if you want. Only on Linux can you do this.
For some reason people try extra hard to make it look like glibc is some integral part of the kernel. Diagrams on Wikipedia showing glibc enveloping the kernel like it's some outer component. Look up Linux manuals and you somehow get glibc manuals instead. The truth is they're completely independent projects. The GNU developers have zero power to force anyone to use glibc on Linux. You can make a freestanding application using system calls directly and literally boot Linux into it. I have made it my goal to create an entire programming language and ecosystem centered around that exact concept.
You are using the term "OS" to mean something other than "kernel", which is the meaning required by this thread (and arguably entire topic) thus far.
This is demonstrated by the rest of your comment, in which you note that linux has a stable, language agnostic system call ABI, and that other libc implementations exist.
TFA was ONLY about Linux syscalls, so what other platforms do or do not is irrelevant in context.
I understand what you mean. It's just that I object on principle to this notion that using system calls is "wrong" or "something normal people aren't supposed to do" as if the libc programmers have special privileges. It's these constant warnings about and fearmongering around the system calls that made me want to use them in the first place.
The warnings etc. are for people who, more likely than not, are using a libc implementation that depends upon its own wrapping of syscalls. As such, they are entirely appropriate.
Do they apply to all possible uses of syscalls? They do not.
Because what getpid returns can change during the lifetime of a process, if it moves to a different PID namespace. I guess updating the vdso every time that happens is not worth the little gains.
> ... can change during the lifetime of a process, ...
Fun fact: this is already an issue for the vdso.
gettimeofday(2) can use the CPU builtin cycle counters, but it needs information from the kernel to convert cycles to an actual timestamp (start and scale). This information too can change during the runtime.
To not have userspace trampled over by the kernel, the vdso contains an open-coded spinlock that is used when accessing this information.
I learnt about this while debugging a fun issue with a real-time co-kernel where the userspace thread occasionally ended up deadlocking inside gettimeofday and triggering the watchdog :-)
Does avoiding userspace being trampled by the kernel mean avoiding a context switch? Does the kernel write to userspace-accessible memory that the spinlock guards?
Unrelated, but what does "open-coded" mean? I never seem to find an obvious answer online.
GP's question is perfectly valid. It doesn't have to be called within a tight loop at all. It could be a latency sensitive routine for example. I can see at least a few real life applications where it would make sense to have the syscall vDSO'd.
It’s not a VDSO because it was never added as one. “Not a VDSO” is the default, a VDSO is additional complexity and maintenance so you need to justify its addition / existence, not its non-existence.
I think we're just talking past each other. I'm asking why you'd want to vDSO getuid(), not getpid(). I can imagine cases where you'd repeatedly be calling getpid().
Eh. All of these things requiring a context switch seems alien to me. So it’s less of a use case and more “just make it fast” since the incremental cost should be 0?
VDSO is designed for things that should be fast and are called often. There is absolutely no reason to call those functions often.
The only time your PID would change from a previous value obtained from that very same function is in the child, after a call to fork(), which is itself so heavy, that it would dwarf your calls to getpid
Is it a scarce resource? For what it’s worth when the caching was removed from glibc it caused a performance problem in libsystemd which had a tight loop to detect when a process was forked. So while this may be true in many cases, there’s clearly cases where it’s not and your answer doesn’t explain why getpid isn’t a vdso.
Maybe the problem is that it is "lib"systemd, meaning it doesn't really have control over when the process forks, it's the application code's responsibility. Although I would probably also avoid this design, but that would probably make the library interface a bit unfriendly, with the application needing to pass its own pid to library functions where appropriate.
The audacity of the design to detect a pid change by looking at it in a loop, is interesting. I don't think I could have come up with something like that. Feels very "sync the web DOM by traversing it and comparing with a shadow DOM".
If only we had an API to tell us when a fork happened. We could place it into the usual library for threading stuff: pthread, and could call it, oh I don't know, maybe pthread_atfork()
It is mentioned in the glibc bug tracker entry as a replacement to the polling. systemd didn't use it because they thought it would be impossible for a library using to be dlclose'd as there is no way to unregister the callback. But apparently glibc has (undocumented) magic under the hood to take care of it.
Memory is always a scarce resource. However, a big you make the vdso, that amount of memory, multiplied by the number of processes is the amount wasted. In reality, it will be more, because of the required additional page tables to map it at an address range that is not commonly used.
Also, what sort of raving lunatic had a busy loop checking the current process PID? Who allowed such a person access to a compiler? That seems like the bigger problem.
I would assume good faith on the part of systemd developers. Afaik they’re as talented as any other top tier group of developers.
Somehow I doubt the code for getpid as a vdso would meaningfully impact the amount of memory used on a running system nor would I expect any meaningful impact to the number of page table entries. Do you have any supporting evidence for your claim otherwise?
No, libsystemd did this because it’s a library. They went with pthread_at_fork as an alternative to detect when the application code forked, but it’s not without its own set of risks. Of course they could have required the user of the API supply the PID everywhere to avoid this but it’s possible even that wouldn’t solve the problem and it complicates the UI and the UI of the library API is equally important as other technical details - it’s a design tradeoff and judging it as some kind of correctness thing without having a clear picture and insulting the authors out of hand is weird behavior.
I agree with everything you said except that "systemd" and "insulting" don't go together like peanut butter and jelly. Insults seem to be the fuel the project is run on.
Library code sometimes needs to detect fork without being in control of when the application calls fork. Think things like CSPRNGs. The state of the art here is to put the CSPRNG state in a memory region that is set to zero on fork (madvise MADV_WIPEONFORK), instead of using getpid.
Nothing busy looped getpid. A library used getpid on each entry to check if the process had forked. The library was called many times.
getpid was a fairly common way to check for fork prior to mechanisms like madvise WIPEONFORK (Linux 4.14, released in 2017). systemd existed prior to 2017 and likely needed to support older kernels past 4.14's initial release. And the getpid cache was something glibc had prior to April 2017, likely to support this kind of use case. It was even documented behavior[1].
So, anyway, what systemd was doing wasn't totally stupid. It became a lot more expensive when glibc broke it on them and then it needed to be improved. But before that it wasn't as objectionable as people in this thread are suggesting.
(Also: pthread_atfork requires linking pthreads, which is or at least historically was seen as a significant burden on applications which do not use pthreads.)
One can go further to get a mean +- std.err of ~695.52 +- ~0.04 nanosec hot-cache time per syscall overhead or even do point-weighting there, but it's probably more important to check fit residuals for serial autocorrelation (very do-able with https://github.com/c-blake/fitl, EDIT: oh, and if you care `a` is basically shown in https://github.com/SciNim/Measuremancer/pull/12 which uses standard error propagation).
The bigger point would not be ever more careful measurement of the cost if you do not, but rather the ease of "batchifying" if you do of any highly regular call interface similar to syscalls with high latency and also the "follow on" API design, granularity-wise. The inner code of a mini-assembly language letting you implement, say, open/fstat/mmap/close all in one syscall crossing is only about 30 lines of C. { Yes, yes..It could use multi-CPU-arch and syscall auditing integration.. and sure, ebpf & io_uring alter the Linux landscape these days }. Also, @exDM69 is very correct that pure hot cache-hot loop numbers are only one part of the cost story (https://news.ycombinator.com/item?id=39188551).
According to some benchmarks I've seen (sorry no link), a system call in the middle of a memory heavy inner loop can impact performance for 100 us before performance is back to steady state. This is 100x longer than the system call alone. Of course there is work being done, but at a reduced throughput. This is in line with my practical experience from working with performance sensitive code.
These things are difficult to benchmark reliably and draw actionable conclusions from the results.
This is not a criticism to the author of this article, the article clearly describes the methodology of the benchmarks and does not suggest any wrong conclusions from the data.