For those wondering why luajit is faster than C/C++/Rust, that's probably because it does direct calls to the function, while C/C++/Rust go through the PLT (procedure linkage table).
I don't have luajit to validate this, but changing the C test to dlsym() the symbol first and use that in the loop instead makes the test take 727ms instead of 896ms when going through the PLT on my machine.
Edit: confirmed after installing luajit, got the same time as the modified C test.
I'm still a bit surprised that that extra jumpq is more expensive than all those extra instructions luajit is executing.
EDIT: For those curious, Lua handles all numbers as floating-point; there are no ints in Lua. That's what all the extra instructions are for. Interestingly it looks like LuaJIT is storing the loop variable x as int (in the eax register) and just converting it to floating-point for comparison against "count" each loop. Interesting bending of the rules there.
So, running under perf stat tells me the luajit version runs very much more instructions (5M vs. 3M with my patched C version), but they run in the same number of cycles (2.5M), with the luajit version using more instructions per cycle. The unpatched C version uses 3.5M instructions, and 3.5M cycles, which is half the number of instructions per cycle compared to the luajit version.
> For those curious, Lua handles all numbers as floating-point; there are no ints in Lua.
Same as in JavaScript. I'm surprised by that. On the surface, having just floats and no ints sounds like a pretty dumb idea. What am I missing? Is there a reasonable rationale for such choice?
Well, doubles can represent all the numbers a 52 bit int can, so there's not a huge reason to not use doubles for whole numbers. It also means that new programmers don't get confused when they do 1/3 and get 1, and the language can be significantly simpler when there's only one number type. Also, when you try to store numbers bigger than 2^52, doubles degrade relatively gracefully, while ints wrap around.
There are definitely downsides too, but it's a trade-off, not an exclusively bad decision.
> Well, doubles can represent all the numbers a 52 bit int can, so there's not a huge reason to not use doubles for whole numbers.
Not exactly.
> It also means that new programmers don't get confused when they do 1/3 and get 1
Is this (actually, getting 0 from 1/3) really more surprising than doing 2+2 and getting 3.9999999999999?
Floats are mostly inexact and introduce errors with almost every operation. It's something you have to constantly keep in mind when doing any kind of math with meaningful comsequences. I for one think that exact types are both more useful for many practical cases, as well as significantly simpler in use.
All floating point integer operations with values smaller than 252 are exact. Most floating point operations with small values with few significant figures are precise.
Fraction, and there are fractions like 1/10 that cannot be represented precisely in base two, much like 1/3 cannot be represented precisely in base 10.
Beyond that, there's the usual issue of pushing out the least significant bits, but that's insurmountable while working in finite precision.
Lua 5.3 introduced actual 64 bit integers but those probably will never get in LuaJIT because its deeply married to the NaN tagging trick, which limits unboxed values to aprox 48 bits.
> This attribute suppresses lazy symbol binding for the function. This may make calls to the function faster, at the cost of extra program startup time if the function is not called during program startup.
> I don't have luajit to validate this, but changing the C test to dlsym() the symbol first and use that in the loop instead makes the test take 727ms instead of 896ms when going through the PLT on my machine.
Isn't this what every C or C++ library that wraps another with dlopen / dlsym does ?
This is what Visual C++'s __declspec(dllimport) is for. :-) On Windows, GCC also seems to handle it correctly, along with its own synonym __attribute__((dllimport)).
(The Win32 example is an almost expression-for-expression translation of the C sample from MSDN called "Your First Windows Program". It re-creates all needed Win32 data types using the FFI macro language.)
The FFI type system has a richly nuanced declarative mechanism to specify data passing semantics (think: who owns what pointer, who must free what) at every level of nesting.
It has bitfields, unions, enums (typable to any integral type). Supports alignment and packing in structs and has special integral types that encode in little or big endian. Unicode, ASCII and UTF-8 string encoding; understands null termination as well as character arrays that are not null terminated. Bitfields can be signed and unsigned and their underlying cell type can be specified.
Pointer types can be tagged with symbols for a measure of type safety, to catch situations when a widget is passed to an API that expects a doohickey and such.
Because Wren's a bytecode interpreted dynamic language much of the overall time is probably spent just executing the surrounding benchmark code. I.e. in:
FFI.start()
while (x < count) {
x = FFI.plusone(x)
}
FFI.stop()
It probably spends most of the total elapsed time on:
FFI.start()
while (x < count) {
x = ...
}
FFI.stop()
It would be worth running a similar benchmark but with just the FFI part removed and then subtract that from the time. (And do the same thing for the other languages too, of course.)
Thanks for the link. From this hn post, I'm baffled at how any other language could beat c at c ffi. (Hopefully your article explains how. Reading it now).
No, it is not. It still makes a call to the method in the shared object, it's just the way this call is made is more efficient than C.
"Inlinig" compiled assembly from a shared object is basically impossible because you can't just move machine code around arbitrary since it may access arbitrary additional code or data at relative offsets. If you move it, you'll break it.
That was my takeaway, but not sure to what extent. You could then do peephole optimization after. You'd lose out on the memory saving from shared objects, but that's a trade-off I'm cool with.
I think it is because of the way go does the stack which is unusual. Perhaps that is done to facilitate green threads. It is too bad because Go has many properties that are good for gamedev - AOT compiled exes, some exposure to pointers (any type can be a value or reference at your whim), low latency GC, and good SDL2 and opengl bindings. But the high cost to call into those bindings is a downside.
There is also a philosophy among the Go maintainers that you shouldn't call into C unless you have to. Unfortunately you can't draw pixels efficiently (at all?) in pure go, so...
> I think it is because of the way go does the stack which is unusual. Perhaps that is done to facilitate green threads.
Kinda. C's stacks are big (1~8MB default depending on the system IIRC), and while that's mostly vmem Go still doesn't want to pay for a full C stack per goroutine, plus since it has a runtime it can make different assumption and grow the stack dynamically if necessary.
So rather than set up a C stack per goroutine, Go sets up its own stack (initially 8K, reduced to 2K in 1.4) and if it hits a stack overflow it copies the existing stack to a new one (similar to hitting the limit on a vector).
But C can't handle that, it expects enough stacks, and it's got no idea where the stack ends or how to resize it (the underlying platform just faults the program on stack overflow), so you can't just jump to C code from Go code, you need an actual C stack for things to work, and that makes every C call from Go very expensive.
Rust used to do that as well, but decided to leave it behind as it went lower level and fast C interop was more important than builtin green threads.
Erlang does something similar to Go (by default a process has ~2.6K allocated, of which ~1.8K is for the process's heap and stack) but the FFI is more involved (and the base language slower) so you can't just go "I'll just import cgo and call that library" and then everybody dies.
Really? Thats not my understanding, its much smarter than that. Goroutines steal memory from the current pthread machine stack, that is, the machine stack. The problem calling C from a goroutine is that whilst you have a real C stack right there .. other goroutines expect to be able to steal memory from it, and once you call C you don't know how much stack C is going to use. So whilst C is running, goroutines cannot allocate stack memory, which means they cannot call subroutines. The only way to fix that is to give the C function being called its own stack.
The problem you're going to have is that if 10K goroutines all call PCRE you need 10K stacks, because all the calls are (potentially) concurrent.
What makes go work is that the compiler calculates how much local memory a goroutine requires and so after a serialised bump of the stack pointer the routine cannot run out of stack. Serialising the bump between competing goroutines is extremely fast (no locks required). Deallocation is trickier, I think go uses copy collection, i.e. it copies the stack when it runs out of address space on the stack, NOT because its out of memory (the OS can always add to the end), but because the copying compacts the stack by not copying unused blocks. Its a stock standard garbage collection algorithm .. used in a novel way.
The core of Go is very smart. Pity about the rest of the language.
> Really? Thats not my understanding, its much smarter than that. Goroutines steal memory from the current pthread machine stack, that is, the machine stack.
There is no "machine stack", and yes in the details it tries to set up and memoise C stacks, but it still need to switch out the stack and copy a bunch of crap onto there, and that's expensive.
I see there is a pull request to test the Erlang BEAM (via Elixir) with seemingly similar results to Go. To me that's pretty interesting given that Erlang's BEAM is a Virtual Machine and Go is not.
Additionally, thanks for the pointer to Tup, sounds like a very interesting Make alternative for some of my larger projects.
Is there a reason Dart is so badly performing here? And wouldn't that have a big impact on Google's new Flutter project that they're building an OS around? I'd assume Flutter would need to be calling into FFI for each access to the Chromium graphics stack that it is built on.
If we use the C performance ratio between the machines to adjust machine to machine comparisons. SBCL would be around 1600 in the original benchmark, between D and Java.
I built out a ruby-ffi test immediately upon seeing this. I want to put a little more effort in to it (I need to run some errands and spent all of 5 minutes on this), but... 2 samples/runs, 500M calls: 51050, 50853. No surprises there, really.
edit/ Ran this on my desktop (gaming rig) i7-4790k, 16GB RAM, gcc 7.3.1 20180406, Arch on Windows Subsystem for Linux. I have a machine closer in specs to the original authors that I'll test it against later today (hopefully).
This is very cool, but note that it is a microbenchmark comparing the overhead of calling plus(int, int). This is a very specific case of FFI that is easy and simple.
For my work on the Oil shell, I care more about moving strings back and forth across the VM boundary (not to mention bigger objects like records and arrays of records). There are many more design choices in that case, and I suspect the results will also look different.
Personally, the overhead doesn't bother me that much; you can manage it by moving more work into C, cutting down on the number of FFI calls. What steers me away from cgo is that it complicates the build/release process and makes it harder to write cross-platform code. Go makes these things painless, so there's a strong incentive to not stray outside the Go toolchain.
Go does not use the C stack, so calling into C requires setting up a C stack, trampolining from the Go context to the C context, making the call and coming back.
This is why Go libraries generally don't wrap native libraries and go software only calls into C when they really have no other choice, and/or have very "coalesced" APIs available (a single function call doing a lot of work whereas languages like Python are happy with making lots of very small calls to C).
And why Go itself does not use the platform-provided standard libraries and performs syscalls directly, breaking any time those syscalls change[0] — which is not that rare because linux is the only platforms where raw syscalls are the kernel's API, on most other systems the libc is the platform's API. And even on linux it breaks from time to time[1] because Go doesn't want to link to libc but still wants to benefit from vdso[2].
I believe most of these if not all of the languages have a native C compatibility layer. Java has JNI. Zig and Rust inherently use llvm. Go has cgo. nim compiles to C. luajit has its own FFI notation. Etc.
This would be worth mentioning on the project issue page, methinks. The author might not want to do the work but it's a very good thing to keep in mind for anybody looking at the project and making judgement calls off of it.
I believe you're misreading the numbers. They're the "elapsed time in millis", so lower is better. The only result better than c itself seems to be luajit, which is quite impressive. I wonder how it achieves that.
OP poses a good question and I don't think it's a simple case of GC vs. no GC. In Nim's case it might have something to do with not using the C int types (PR already open to fix this: https://github.com/dyu/ffi-overhead/pull/6)
Higher-level languages know more about the context and the stack. You also need a good inliner.
You can do fast calls when you know exactly what registers you need to save and which not. With pure generic C you need to save most, with a more high-level language only those really needed, or even none, when you use the stack instead.
Common-Lisp e.g. usually leads the ffi benchmarks.
In any case C or higher level language the caller knows what registers are in used and need saving (to the extent they intersect with the ABIs clobbered register set).
In any case the caller usually doesn't know what registers the callee will use and so has to rely on the ABI.
Of course if things can be inlined its a different story, but this is about FFI, i.e., no inlining.
But several VM's/ABI's don't clobber the regs by themselves (i.e. all locals are volatile), thus the external C func can use all the regs they want. Hence the intro doesn't need to save much. E.g. this also helps with GC pauses to scan registers, not needed there. We had great success with such an ABI (potion, a jitted lua variant).
Native compilers with a better ABI than C has also advantages: Fortran (no need to save the return ptr), Common Lisp (mult. return values, can tune the FFI much better than usual ffi's), Haskell.
The biggest problem is still callback, calling functions from the FFI back to us. This is costly. The FFI decl. needs to know that.
And closures (local upvalues on the stack) are only properly supported with Haible's avcall.
And some newer VM's solved the expensive FFI or C extension problem by redoing the external C part, such as pypy or Graal. They convert the called C function to their own format, and as such can inline it or know the used registers.
I'm not following you. Clearly in a pure JIT environment or with custom calling conventions you can do better than a typical "blind" C call restricted by the platform ABI - but we're talking about FFI calling into a "plain" shared object, right?
Such functions obey the platform ABI, so the caller needs to save exactly every callee-clobbered register they are using.
Also, if the external C function wants to use any callee-saved registers, it needs to save them. It doesn't know that it's being called by a VM that allows FFI functions to clobber everything (if that's what you're saying).
So yeah, there is all kinds of great stuff you can do with respect to function calls within a VM, or when you are working outside the platform C ABI (e.g., if you invented your own ABI), but it isn't clear how that relates to making fast FFI calls from higher level languages.
I don't have luajit to validate this, but changing the C test to dlsym() the symbol first and use that in the loop instead makes the test take 727ms instead of 896ms when going through the PLT on my machine.
Edit: confirmed after installing luajit, got the same time as the modified C test.