When FFI function calls beat native C

gavinray · on April 4, 2022

Can someone ELI5 what exactly happens when the external method call is "JIT"ed in a language like LuaJIT -- what does that mean?

I understand calling dlopen() and dlsym()

And I understand this idea of a PLT and it's indirection

But this idea of something external to the JIT'ed program being JIT'ed I do not understand.

Does it mean it inlined the instructions of the external function into the JIT'ed code?

deathanatos · on April 4, 2022

No, I don't think Lua (or the author) is inlining the external function.

What's being JIT'd is the Lua code into machine code, and where Lua would need to call a C function, it (apparently) is just emitting a `call the_external_fn` into the JIT's resulting assembly. That's a direct function call, so it's about as fast as you're going to get, but somewhat counter-intuitively, it'll be faster than C, as we don't do anything with the PLT or indirection. Just call the function.

dataflow · on April 5, 2022

I think you need a layer of indirection there - you can't just directly call a function in another library in x64, whereas the library itself can. So in general the library should have a speed advantage when calling itself in x64.

gavinray · on April 4, 2022

Ahh got it, that makes sense -- thanks!

timsneath · on April 4, 2022

Note that this article is four years old. I was trying to figure out why the language I look after (Dart) looked so bad, but then I realized that the benchmark is using a completely obsolete (now removed) approach to FFI and running all the code in interpreted mode, rather than compiling it.

markdog12 · on April 5, 2022

How does it run with the new FFI?

MuffinFlavored · on April 4, 2022

Is it possible to use Lua/LuaJIT in the opposite FFI direction (aka instead of invoking FFI functions, providing them)

aka, upon DLL invocation (on Win32 they call it DllMain I guess, I've seen it call ctor/dtor on *nix), spawn the runtime, and expose FFI functions?

http://www.drewtech.com/support/passthru.html

This spec is really big on FFI exposing functions. I always find it an edge case when trying to play with certain technologies (like the one in the article).

samatman · on April 4, 2022

Lua (and therefore LuaJIT) can easily be driven from a C program, if that's what you mean.

LuaJIT also allows you to pass a Lua function as a callback to a C function expecting one, subject to some limitations.

I'm not sure if either of those answers your question though.

exikyut · on April 8, 2022

Sorry for the late reply, lost the tab for a bit ._.

Something that might be relevant is that that LuaJIT can JIT-optimize FFI calls from Lua but can't/doesn't optimise calls into Lua made via C.

I might be able to go digging for the reference (at this point figure it's best to just reply for now, not sure if you'll see this) but I've read that "the approach" recommended to solve this problem is to move the main `for(;;)` / `while(1)` loop into Lua and have LuaJIT repeatedly FFI-call C, because that's the path that can go the fastest.

sneusse · on April 4, 2022

You could export a varargs function and use this to call the appropriate Lua function, I guess.

twic · on April 4, 2022

Essentially, a JIT does static linking, just at runtime.

dfox · on April 4, 2022

The better takeaway is that on ELF/PIC platforms a call to dynamically linked function is somewhat more expensive than one would assume.

With modern CPU architectures that can prefetch/speculate over indirect jump/call making it comparatively expensive to indirect call in a tight loop (the idea behind PLT is that in contrast to indirect call through GOT it should more clearly signal the intent to the CPU). On the other hand the tight loop is certainly important and this effect will not be so pronounced on any kind of practical code (because of the BTB pressure).

Also, this is interesting observation for various discussions about overhead of late binding ("virtual" in C++) as similar overhead is already there for almost any cross-object function call in PIC dynamic binary.

symmetricsaurus · on April 4, 2022

Previously: https://news.ycombinator.com/item?id=17171252

(Linked from the article)

emerged · on April 4, 2022

It’s been a while since I’ve worked with PE format but isn’t this the purpose of a fixup table? The executable loader can patch the machine language to make direct calls? Step further you have LTCG which may copy and paste the actual code and recompile it inline.

So this is a Linux or *nix specific quirk, rather than a C quirk. Apologies if my memory isn’t accurate.

dfox · on April 4, 2022

This is about different tradeoffs. On typical unix ELF platform the idea is to preserve memory by sharing as much of _shared_ object code as possible, while NT emphasizes the _dynamic_ linking. Notably the approach of just giving up, unsharing and relinking the PE/COFF when it does not fit at its preferred load address is not the only one implemented by windows variants: Windows 9x is designed for memory starved environments and actually implements the PE/COFF linker in kernel and can relink on per-page basis when swapping the page in (W9x is stupidly complex ridiculously brilliant hack that solves weird problems in wonderful ways) and most builds of Windows CE will just error out when relinking would be required. In fact Windows SDK contains tool that relinks some set of PE/COFF DLLs such as to minimize load conflicts (for NT/9x it is performance optimalization, for CE it is often necessary).

On typical ELF platform with dynamic linking you get PIC compiled binaries with the added overhead of GOT and PLT. One thing to keep in mind is that on many traditional unix RISC platforms there is something similar to GOT and PLT even in statically linked binaries (no word-sized immediaties on RISC platforms…).

grishka · on April 4, 2022

ELF relocations can also patch the machine code if I understand correctly.

eesmith · on April 4, 2022

For similar reasons, PyPy's Python implementation can outperform C.

https://www.pypy.org/posts/2011/02/pypy-faster-than-c-on-car... - JIT'ing across compilation units

https://www.pypy.org/posts/2011/08/pypy-is-faster-than-c-aga... - JIT'ing % interpolation.

(Wow, those are 11 years old. I remember when PyPy was a new project.)

_dh54 · on April 4, 2022

Are direct calls really all that much faster than indirect calls on current x86 archs? I was under the impression that it’s more or less the same on the current generation of CPUs. Those CPUs do a decent job of branch predicting indirect calls, especially in a micro benchmark loop. The BTB generally works well.

Kranar · on April 4, 2022

The article has benchmark results that quantifiably establish direct calls outperforming indirect calls.

_dh54 · on April 4, 2022

It’s not an apples to apples comparison unfortunately. He’s using custom assembly for the direct call benchmark but C code for the indirect benchmark.

The C code contains no optimization annotations either, the compiler could be inlining the indirect benchmark and/or devirtualizing the indirect call itself.

Kranar · on April 4, 2022

The benchmark quantifiably shows that direct calls are faster than indirect calls, which was your original question. Could a hypothetical C compiler transform an indirect call to a shared library into a direct call? Maybe, but that is different from your original question as to the performance of direct vs. indirect calls on modern x86 architectures which this benchmark shows is not the same.

_dh54 · on April 4, 2022

Unless he uses the same custom assembly except with an indirect call, it’s not a good comparison. We can’t be sure the increase in runtime is due to the indirect call.

Kranar · on April 4, 2022

All the details are right in the article as well as a link to a git repo so I'm not sure what there is to speculate about. If you have an issue with the actual benchmark you can certainly point it out, but otherwise you're basically asking us to restate the contents of the article when the article does a much better job of explaining these details.

_dh54 · on April 4, 2022

Hmm I’m not sure you’re responding to what I’m saying. The C code is not an apples to apples comparison with the custom assembly when comparing the speed of an indirect call to a direct call. Do you deny that?

Kranar · on April 4, 2022

Yes I do deny that, especially since the article literally addresses this issue explicitly and takes care to avoid that, along with a git repo that you can use to verify this for yourself. If you have a specific criticism to make then you should go ahead and point it out in a non-vague manner instead of speculating.

_dh54 · on April 4, 2022

> especially since the article literally addresses this issue explicitly and takes care to avoid that

Can you cite where in the article it addresses the fact that the assembly snippet is not an apples to apples comparison with the C code?

> If you have a specific criticism to make

Pointing out that the assembly is not an apples to apples comparison with the C code is a specific criticism.

freemint · on April 4, 2022

No. That is not how shared libraries work.

vient · on April 4, 2022

> If the JIT code needed to call two different dynamic functions separated by more than 2GB, then it’s not possible for both to be direct.

Well, you can do

  MOV rax, 0x1122334455667788
  PUSH rax
  RET

in this case. Still direct, just a bit slower. Wonder if modern CPUs speculate past this construction.

dfox · on April 4, 2022

RET is typically special cased in the speculation logic. I suspect that this construction will have significantly larger miss prediction rate (on some uArchs quite possibly 100%) than straight indirect jump.

marcodiego · on April 4, 2022

Considering this is more a ELF/Linux thing than C, it means there is space for performance improvement of ffi/so-heavy processes on Linux. I wonder why nobody cared to improve it.

cogman10 · on April 4, 2022

The answer is in the article

> The downside to this approach is slower loading, larger binaries, and less sharing of code pages between different processes. It’s slower loading because every dynamic call site needs to be patched before the program can begin execution. The binary is larger because each of these call sites needs an entry in the relocation table. And the lack of sharing is due to the code pages being modified.

It probably could be improved by essentially doing the same thing that luajit is doing and inventing a JIT code loading mechanism, but that would be hard to get a lot of buy-in. You need to change the ELF standard and get the likes of GCC and LLVM on board with this new paradigm.

benmmurphy · on April 4, 2022

there could also be problems on some architectures. on arm64 the branch-link instruction can only jump +/- 128MB (https://developer.arm.com/documentation/dui0802/a/A64-Genera...). so if you want to use BL instead of BLR for performance reasons then you need to make sure all the shared libraries are loaded in the same 128MB address space.

gcc has an option to reduce the overhead with no-plt (https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html). so on x86_64 it just does call[rip + got_offset]. this is still an indirect call but it reduces the number of calls by 1.

josefx · on April 5, 2022

Because nobody cares about performance. As far as I remember on Linux every C symbol not explicitly inline or static defaults to visible and dynamically linked. You can try to tell gcc to treat every symbol as hidden by default but at that point every third party header would have to explicitly declare its exported symbols as visible. That whole dllexport/import mess Windows binaries require? That is the implicit default on Linux.

forrestthewoods · on April 4, 2022

Does anyone know what this benchmark would look like on Windows?

rurban · on April 4, 2022

yes. my jit which does the same thing was 10x faster on windows than on Linux. apparently the elf hash lookup beats the linear or whatever primitive search on coff.