If anyone's looking for an example, I used this trick a few months ago to embed a tiny helper binary[0] directly into my application[1] so I wouldn't have to ship two executables or add "hidden" behavior to the main program. It works really well (on Linux)!
I've tested in Fabrice Bellard JSLinux with tcc (x86 arch) and on https://replit.com/languages/c (x64). I failed to see any side effect at all. gdb "catch syscall" doesn't show anything interesting too. Looks like TINY_ELF_PROGRAM is not doing anything.
Even though memfd was introduced in Linux 3.17 [1] it was a few years until its other uses became apparent enough to gain more widespread adoption. In the Windows world MemDllLoader et. al. was being widely used by malware to reduce forensic fingerprints, but no architecture-portable or lightweight solution existed for Linux.
Nowdays with a combination of ebpf, apparmor, cgroups, kvm, nx-stack and a strict firewall it's possible to almost entirely prevent external code from being run (after performing in-depth profiling of its intended behaviour). Sadly nobody does that, and if anything Linux on the desktop is missing most, if not all, of the process isolation features Android and iOS have.
All those low-level isolation tools are hard to use. Ok, creating namespaces is easy enough, but it doesn't end there. Maybe you need to setup the filesystem, pivot_root, setup seccomp, ensure no privileged file descriptors keep lying around, various prctls etc. etc.
OpenBSD's pledge+unveil are more convenient in many scenarios. And because inheritance is optional with them you can also compose self-isolating components more easily than in linux.
Only because you/we don't have the keys. It's not the process isolation stuff that makes them a walled garden, it's the lack of user configurability of those protections
You have jailroots and SELINUX and other such features in Linux, that doesn't make it a "walled garden". As the sibling comment said, it's about who controls the garden and who lets or doesn't let others inside the wall.
Or, better put, this is not like making Linux a "walled garden" but being able to put walls around each app you run - which is different.
This is not missing. For example qubesOS has very strong process isolation, spinning up seperate read-only VMs that can (optionally) be completely disposable, so can safely be used to run actively hostile code if required.
VMs are substantially heavier than process isolation. Isolated processes can share certain resources, VMs often boot a separate instance of absolutely everything they depend on (down to the kernel or even emulated hardware).
That was an example to show it's not something that does not exist in the linux world. There are less extreme examples through the whole spectrum of isolation, from firejail at one side (which uses capabilities) through a couple of container-based approaches all the way to qubesOS, which is the most robust from a security standpoint I would think and is also the example I chose because I'm most personally familiar with it.
As for the side-question of switching between 64 and 32 bit mode in the same process, this is classically known on Windows as "heaven's gate" and a similar technique on Linux seems possible too: https://gist.github.com/rqou/1a1834b784283add7955af430097311...
Correct but in lots of scenarios (containers etc) you cannot execute ptrace() but you can execute mmap(), mprotect(), read(), write() which is all you really need. Edit: and fork().
I haven't tried it, but I would expect `shm_open(3)` to also work in a pinch. The downside there is probably a smaller maximum process size (4MB?), but that's plenty for most uses.
On Linux, shm_open(3) just creates a regular file under the tmpfs mounted at /dev/shm. It won't work if the program doesn't have permission to create files there.
Sure, but isn’t that equally true for the tmpfs that memfds use? I’d expect normal userspace programs to have access to both under most conditions, including when all disk-backed file systems are read-only.
Edit: I was wrong about the names given to memfd objects, I thought they showed up under /dev somewhere but they’re purely for debugging purposes.
memfd is a tmpfs file descriptor, but does not use any mounted tmpfs filesystem. It works no matter what filesystems are mounted or access you have.
It's truly great for situations where APIs refuse to take anything other than files and you don't worry about cleanup. Ex: loading certs from memory into a python openssl context.
Just the other day I was wondering why execve is a syscall in the first place, feels like lot of what its doing under the hood would really belong to userspace as it doesn't require special priveleges.
Definitely tricky. I solved it with a Python implementation by building up a big jumpbuffer so that the moment I leave Python-land I copy from temporary buffers to the right addresses and then ultimately jump at the entry point of the newly loaded binary. It's tricky and took quite some debugging to get right, but it's proven rather solid now.
Naive question what are the non-nefarious requirements of being able to do this? I get that people have used it to work around things (for good and ill) but what’s the vanilla answer?
This is technically still a work-around, but in container runtimes (specifically runc and LXC) we do this to defend against a class of attacks where the container can overwrite the host runc binary (meaning the next time some container operation is done, the attacker's code is executed as root on the host)[1]. Doing this each time the container starts ensures any such attack will only overwrite its own (short-lived) copy of the binary and allows us to do this without having write access to any filesystem that allows exec.
Unfortunately we don't use this all the time because some Kubernetes unit tests started failing when we first added this protection (the size of the binary is added to the memory usage of each container which caused some Kubernetes unit tests to use more memory than they did before). Ironically this exact protection would've protected us from Dirty COW and other such bugs but it's disabled by default (instead we make a temporary read-only bind-mount that we then exec which is slightly less safe but doesn't add ~10MB to every containers' memory usage).
But the actual answer to your question is that this was not originally intended behaviour (when we mentioned we were doing this to the mm and fs folks they weren't happy) and there have been patches posted recently to make this feature something you have to explicitly opt-in to.
[0]: https://github.com/impl/systemd-user-sleep/blob/666cf29871b1...
[1]: https://github.com/impl/systemd-user-sleep/blob/666cf29871b1...