Linux-insides: System calls in the Linux kernel, Part 2

amluto · on Aug 30, 2015

I'm in the process of rewriting basically all of the 32-bit syscall code and a decent amount of the 64-bit code. Enjoy!

Someone · on Aug 31, 2015

1. Why?

2. (How) will you keep the interface compatible?

caf · on Aug 31, 2015

As I understand it, the aim is to move as much of it as possible into C (rather than asm). Ultimately this should make it more maintainable, and lessy buggy.

Backwards compatibility is a given.

amluto · on Sept 2, 2015

Backwards compatibility isn't quite a given. It turns out that 32-bit programs running on 64-bit Linux and using the AT_SYSINFO fast system call mechanism (which seems to include anything build on Fedora using -m32 but seems not to include anything at all on 32-bit Debian -- go figure) that are ptraced do very strange things on AMD chips depending on what the ptracer does. I'm planning on making everything work the way it does on Intel and on native 32-bit systems [2].

I'm really, really hoping that no one hardcodes SYSCALL into 32-bit binaries. There's good reason to believe I'm right: it will fail with SIGILL on any Intel system and on all native 32-bit systems whatsoever, and it'll blow it in strange ways even on AMD systems unless the SYSCALL instruction is followed by an explicit reload of the SS register, because reasons [1].

If you're interested, v1 is here:

http://thread.gmane.org/gmane.linux.kernel/2030456

[1] It's reasonably widely known that Intel screwed up SYSRET in a way that caused basically every OS ever to have an exploitable root hole. It's less widely known that AMD screwed it up in a way that can cause the hidden part of the SS register to be corrupt. Very new Linux kernels work around this in the kernel, but most Linux kernels actually had an SS reload in the vDSO.

[2] AMD's design for the SYSCALL instruction as implemented for 32-bit code running under a 64-bit kernel is bad in a way that makes using it quite nasty. AMD's design for the SYSCALL instruction as implemented for 32-bit code running under a 32-bit kernel is so thoroughly awful that Linux has never supported it. [3] I've heard that it's also full of errata.

[3] If you're interested, think about what happens when you do SYSENTER with the TF flag set. This is already awful. Then think about what happens with if you do the full 32-bit SYSCALL with TF set. Keep in mind that 32-bit systems don't have the (also rather broken) IST mechanism with which to fix up the damage that SYSCALL with TF set will cause.

Someone · on Aug 31, 2015

Less buggy in the sense of "bugs make it into a release" or less prone to regressions that get discovered soon? I would the former, but would be interested in an example of the latter.

Also, I would guess that also runs the risk of making it less portable across compilers (you need non-standard compiler features to implement this in C). Is that a concern?

amluto · on Sept 2, 2015

> Also, I would guess that also runs the risk of making it less portable across compilers (you need non-standard compiler features to implement this in C). Is that a concern?

Actually, no. My code implements just enough in asm that the C part is a normal function using the normal C ABI.

There are some microoptimizations that would be possible if I were to rely on __builtin_frame_address, but GCC has some highly questionable optimizations (or arguably outright bugs) that make me quite nervous about using it.

dnautics · on Aug 31, 2015

Does anyone know why syscall variables are loaded into registers in linux instead of left on the stack (as in freebsd, iirc)?

My guess would be performance... If that's the case has anyone benchmarked that?

oso2k · on Aug 31, 2015

This seems a lot more complicated than what I implemented in rt0 [0].

[0] https://github.com/lpsantil/rt0

caf · on Aug 31, 2015

This is describing the kernel side of the syscall boundary, though.