Can I exec a new process without an executable file? (2015)

impl · on Nov 3, 2022

If anyone's looking for an example, I used this trick a few months ago to embed a tiny helper binary[0] directly into my application[1] so I wouldn't have to ship two executables or add "hidden" behavior to the main program. It works really well (on Linux)!

[0]: https://github.com/impl/systemd-user-sleep/blob/666cf29871b1...

[1]: https://github.com/impl/systemd-user-sleep/blob/666cf29871b1...

jart · on Nov 3, 2022

Here's a simple concrete example for folks who don't know Rust:

    int main(int argc, char *argv[]) {
    #define TINY_ELF_PROGRAM "\
    \177\105\114\106\002\001\001\000\000\000\000\000\000\000\000\000\
    \002\000\076\000\001\000\000\000\170\000\100\000\000\000\000\000\
    \100\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\
    \000\000\000\000\100\000\070\000\001\000\000\000\000\000\000\000\
    \001\000\000\000\005\000\000\000\000\000\000\000\000\000\000\000\
    \000\000\100\000\000\000\000\000\000\000\100\000\000\000\000\000\
    \200\000\000\000\000\000\000\000\200\000\000\000\000\000\000\000\
    \000\020\000\000\000\000\000\000\152\052\137\152\074\130\017\005"
      int fd = memfd_create("foo", MFD_CLOEXEC);
      write(fd, TINY_ELF_PROGRAM, sizeof(TINY_ELF_PROGRAM)-1);
      fexecve(fd, argv, environ);
    }

Who here is brave enough to run my C string?

anshargal · on Nov 3, 2022

I had to add

  #define _GNU_SOURCE
  #include <sys/mman.h>
  #include <unistd.h>

I've tested in Fabrice Bellard JSLinux with tcc (x86 arch) and on https://replit.com/languages/c (x64). I failed to see any side effect at all. gdb "catch syscall" doesn't show anything interesting too. Looks like TINY_ELF_PROGRAM is not doing anything.

enriquto · on Nov 3, 2022

you can disassemble the code portion, they are single-byte instructions

    push 0x2a
    pop edi
    push 0x3c
    pop eax
    db 0x0f, 0x05 ; invalid?

qguv · on Nov 3, 2022

    ; set the first syscall argument to 42
    push   0x2a
    pop    edi

    ; select syscall 60 (sys_exit)
    push   0x3c
    pop    eax

    ; sys_exit(42)
    syscall

enriquto · on Nov 3, 2022

LOL my asm is rusty, didn't even know about the syscall instruction (I'd have used int 0x80 here)

mFixman · on Nov 3, 2022

What's the point of `push`ing constants to the stack and `pop`ing them to registers instead of `mov`ing them directly?

saagarjha · on Nov 3, 2022

I think they encode smaller?

anshargal · on Nov 3, 2022

That's cool - I wonder why exit code is not set to 42 in practice? binary still returns 0, must be a bug somewhere.

saagarjha · on Nov 3, 2022

A successful run not exiting with EXIT_SUCCESS usually requires a good reason to justify it.

tralarpa · on Nov 3, 2022

It returns the answer to an important question :)

Edit: Wouldn't mov be shorter than push/pop? (I am not very familar with x86)

kubanczyk · on Nov 4, 2022

No, mov is longer. Push stores the constant as just one byte and omits the three zero bytes. https://stackoverflow.com/questions/56618815/why-use-push-po...

quesomaster9000 · on Nov 3, 2022

Even though memfd was introduced in Linux 3.17 [1] it was a few years until its other uses became apparent enough to gain more widespread adoption. In the Windows world MemDllLoader et. al. was being widely used by malware to reduce forensic fingerprints, but no architecture-portable or lightweight solution existed for Linux.

Nowdays with a combination of ebpf, apparmor, cgroups, kvm, nx-stack and a strict firewall it's possible to almost entirely prevent external code from being run (after performing in-depth profiling of its intended behaviour). Sadly nobody does that, and if anything Linux on the desktop is missing most, if not all, of the process isolation features Android and iOS have.

[1] https://www.phoronix.com/news/MTc2NzQ

the8472 · on Nov 3, 2022

All those low-level isolation tools are hard to use. Ok, creating namespaces is easy enough, but it doesn't end there. Maybe you need to setup the filesystem, pivot_root, setup seccomp, ensure no privileged file descriptors keep lying around, various prctls etc. etc. OpenBSD's pledge+unveil are more convenient in many scenarios. And because inheritance is optional with them you can also compose self-isolating components more easily than in linux.

saagarjha · on Nov 3, 2022

Trying to stymie an implant on your machine that already has priveleged code execution is kind of a moot point anyways, memfd or not.

userbinator · on Nov 3, 2022

Sadly nobody does that, and if anything Linux on the desktop is missing most, if not all, of the process isolation features Android and iOS have.

You mean fortunately? Android and iOS are siloed walled gardens, not general-purpose OSs.

denkmoon · on Nov 3, 2022

Only because you/we don't have the keys. It's not the process isolation stuff that makes them a walled garden, it's the lack of user configurability of those protections

coldtea · on Nov 3, 2022

You have jailroots and SELINUX and other such features in Linux, that doesn't make it a "walled garden". As the sibling comment said, it's about who controls the garden and who lets or doesn't let others inside the wall.

Or, better put, this is not like making Linux a "walled garden" but being able to put walls around each app you run - which is different.

seanhunter · on Nov 3, 2022

This is not missing. For example qubesOS has very strong process isolation, spinning up seperate read-only VMs that can (optionally) be completely disposable, so can safely be used to run actively hostile code if required.

https://www.qubes-os.org/

striking · on Nov 3, 2022

VMs are substantially heavier than process isolation. Isolated processes can share certain resources, VMs often boot a separate instance of absolutely everything they depend on (down to the kernel or even emulated hardware).

seanhunter · on Nov 3, 2022

That was an example to show it's not something that does not exist in the linux world. There are less extreme examples through the whole spectrum of isolation, from firejail at one side (which uses capabilities) through a couple of container-based approaches all the way to qubesOS, which is the most robust from a security standpoint I would think and is also the example I chose because I'm most personally familiar with it.

userbinator · on Nov 3, 2022

The same thing on Windows can be done by essentially creating a process based on any file and then rewriting its contents in memory: https://stackoverflow.com/questions/305203/createprocess-fro...

As for the side-question of switching between 64 and 32 bit mode in the same process, this is classically known on Windows as "heaven's gate" and a similar technique on Linux seems possible too: https://gist.github.com/rqou/1a1834b784283add7955af430097311...

tedunangst · on Nov 3, 2022

One could also exec any available 64 bit program and then overwrite all its memory with ptrace on Linux.

ohwutwathere · on Nov 3, 2022

Correct but in lots of scenarios (containers etc) you cannot execute ptrace() but you can execute mmap(), mprotect(), read(), write() which is all you really need. Edit: and fork().

See something I published just a month ago: https://github.com/anvilsecure/ulexecve/

mobilio · on Nov 3, 2022

Decades ago i was make similar technique but also involving CreateRemoteThread or CreateRemoteThreadEx.

ohwutwathere · on Nov 3, 2022

You can. Different ways to skin the cat and the memfd trick works fine. However there are other ways of doing it and being more stealthy.

See for example https://www.anvilsecure.com/blog/userland-execution-of-binar.... The implementation is on GitHub and rather clean if I may say so myself (am the author).

insanitybit · on Nov 3, 2022

Ah perfect, I was just looking for memfd_create the other day to see if I could trick some endpoint agents. Serendipitous.

woodruffw · on Nov 3, 2022

I haven't tried it, but I would expect `shm_open(3)` to also work in a pinch. The downside there is probably a smaller maximum process size (4MB?), but that's plenty for most uses.

LegionMammal978 · on Nov 3, 2022

On Linux, shm_open(3) just creates a regular file under the tmpfs mounted at /dev/shm. It won't work if the program doesn't have permission to create files there.

woodruffw · on Nov 3, 2022

Sure, but isn’t that equally true for the tmpfs that memfds use? I’d expect normal userspace programs to have access to both under most conditions, including when all disk-backed file systems are read-only.

Edit: I was wrong about the names given to memfd objects, I thought they showed up under /dev somewhere but they’re purely for debugging purposes.

paulfurtado · on Nov 4, 2022

memfd is a tmpfs file descriptor, but does not use any mounted tmpfs filesystem. It works no matter what filesystems are mounted or access you have.

It's truly great for situations where APIs refuse to take anything other than files and you don't worry about cleanup. Ex: loading certs from memory into a python openssl context.

remram · on Nov 3, 2022

Can't you change personality (from 32bit to 64bit) without exec? And then jump into your new code?

(I understand this is not necessarily a solution to the question and the existing solution is probably a better fit, but I'm curious)

rep_lodsb · on Nov 3, 2022

Yes you can. For 32 -> 64, and assuming the OS supports both modes in the same process, this code should be sufficient:

    switch_to_64bit:
            pop     edx             ;EDX=return address
            xor     ebx,ebx         ;EBX=selector
    .next_sel:
            add     bx,8            ;try next
            jc      .exit           ;none found, -> segfault
            lar     eax,bx          ;load access rights
            jnz     .next_sel       ;failed?
            and     eax,0x60F400    ;mask bits
            cmp     eax,0x20F000    ;64bit code selector?
            jne     .next_sel       ;no
    .exit:
            or      bl,3            ;set RPL to ring3
            push    ebx             ;selector
            push    edx             ;offset
            retfd                   ;go there

zokier · on Nov 3, 2022

Just the other day I was wondering why execve is a syscall in the first place, feels like lot of what its doing under the hood would really belong to userspace as it doesn't require special priveleges.

klodolph · on Nov 3, 2022

Setuid!

Then there’s miscellaneous stuff like cloexec. Not privileged, but atomic.

The binary you exec may load code at the same address you’re using for code, unless it’s PIE. Not insurmountable, but tricky.

meandmymey234 · on Nov 3, 2022

Definitely tricky. I solved it with a Python implementation by building up a big jumpbuffer so that the moment I leave Python-land I copy from temporary buffers to the right addresses and then ultimately jump at the entry point of the newly loaded binary. It's tricky and took quite some debugging to get right, but it's proven rather solid now.

See https://github.com/anvilsecure/ulexecve/blob/main/ulexecve.p... for details. Especially the `CodeGenerator` classes with implementations in x86, x86-64 and aarch64.

silon42 · on Nov 3, 2022

It's also essential for reliability to start from a clean slate (at least regarding memory, if not all resources -- why cloexec should be the default)

g0xA52A2A · on Nov 3, 2022

Naive question what are the non-nefarious requirements of being able to do this? I get that people have used it to work around things (for good and ill) but what’s the vanilla answer?

cyphar · on Nov 3, 2022

This is technically still a work-around, but in container runtimes (specifically runc and LXC) we do this to defend against a class of attacks where the container can overwrite the host runc binary (meaning the next time some container operation is done, the attacker's code is executed as root on the host)[1]. Doing this each time the container starts ensures any such attack will only overwrite its own (short-lived) copy of the binary and allows us to do this without having write access to any filesystem that allows exec.

Unfortunately we don't use this all the time because some Kubernetes unit tests started failing when we first added this protection (the size of the binary is added to the memory usage of each container which caused some Kubernetes unit tests to use more memory than they did before). Ironically this exact protection would've protected us from Dirty COW and other such bugs but it's disabled by default (instead we make a temporary read-only bind-mount that we then exec which is slightly less safe but doesn't add ~10MB to every containers' memory usage).

But the actual answer to your question is that this was not originally intended behaviour (when we mentioned we were doing this to the mm and fs folks they weren't happy) and there have been patches posted recently to make this feature something you have to explicitly opt-in to.

[1]: https://lwn.net/Articles/781013/

jjtheblunt · on Nov 3, 2022

doesn't fork essentially create a new process (from an existing one)?

LegionMammal978 · on Nov 3, 2022

Presumably, the question is about whether a new process image can be loaded from a source other than a regular file with the executable bit set.

benj111 · on Nov 3, 2022

Why didn't fork come up as an answer? That would work right? Or am I missing something?