Partially Applied Functions in C

haberman · on July 21, 2013

Interesting hack. This is roughly equivalent to writing a dynamic code generator (JIT) for generating tiny little trampolines, except it's architecture-independent. It achieves portability by patching up some existing code (for an unknown instruction set) rather than generating code from scratch (which would require you to understand the instruction set).

It would need some changes to be truly architecture-independent; for example it should use memcmp() instead of this idiom which can throw SIGBUS on some platforms due to the unaligned access:

    if (*((intptr_t *)code) == look) {

I used a similar hack once when I wanted to create an ELF object file at runtime, which is necessary to use the GDB JIT interface (http://sourceware.org/gdb/onlinedocs/gdb/JIT-Interface.html). I didn't want to implement the whole ELF format myself, but I needed this ELF file to point to some JIT code that is located at an address that is only known at runtime. I achieved this by generating the ELF object file with gcc at build time, but used a similarly distinctive constant for the address so I could patch it up at runtime.

If you wanted to do this through straight-up codegen (instead of patching) this would be really easy with DynASM, and you could use my tutorial here as a guide:

http://blog.reverberate.org/2012/12/hello-jit-world-joy-of-s...

thedigitalengel · on July 21, 2013

It isn't truly arch-independent till it assumes PARAMETER_CONSTANT and FUNCTION_CONSTANT will be stored as direct immediates in the generated code. On some archs, for instance, 0xFEEDBEEF might be too big a constant; and the compiler would then be forced to move it to a register (or a stack slot) in parts.

Edit: and of course, you run the possibility that on some archs, 0XFEEDBEEF is actually a valid encoding for some instruction. :)

Someone · on July 21, 2013

For examples of a CPU that behave differently, lok at RISC CPUs. 32-bit PowerPC, for example, would translate an immediate long load into an immediate 'load short into high word and zero out low word' and a signed immediate addition (it would load $DEAE first, then add -$4111 to get $DEADBEEF)

The list of problems is way longer, by the way. This code makes assumptions on pointer size (I don't think it will run on x64 with common ABI's)

There also is no guarantee that function pointers point to the memory where the function's code can be found (there could be a trampoline in-between, or a devious compiler writer could encrypt function addresses).

Neither is there a guarantee that functions defined sequentially get addresses that are laid out sequentially (there is no portable way to figure out the size of a function in bytes).

Finally, I don't think there is a guarantee that one can read a function's code (its memory could be marked 'execute only').

I guess those more familiar with The C standard will find more portability issues.

saurik · on July 22, 2013

ARM (quite common these days) will sometimes do this as well, as the immediate value of an instruction has to fit inside of the fixed-width instruction, and the instruction is only itself 32-bits (some required for overhead). While it is slightly more likely you will see a PC-relative load, that requires an extra data fetch; so, instead, a "movt" (move top) instruction is used to set the upper 16-bits of a register after first setting the lower 16-bits. This requires just as much space as the load and separate data literal, and runs entirely out of the instruction cache.

raverbashing · on July 22, 2013

What I saw an ARM compiler doing once is putting the 0xDEADBEEF into a data segment (in the 'text' area I think) then loading from there

Like it would load a string, but only 4 bytes, it can load in one step

saurik · on July 22, 2013

Not certain if you are disagreeing, so to clarify: you can either do a load like that or two moves. The load requires one instruction word and one data word, so two words. The moves require two instruction words, so also two words. The load however, requires a data fetch. This fetch has to come from somewhere nearby, as the load instruction cannot offset very far, and so is going to be in the same segment as the code (called "text", for historic reasons: this has nothing to do with strings, which will be stored in the "data" segment, or sometimes even a "strings" segment), and most of the time will be directly after (or, for long functions, inside) of the code for that function. This is a trade off: moves are fast (cycle counts of instructions are not all the same, so "single step" doesn't mean much), have better guarantees that they will always have the same result (code changing has much heavier penalties than data changing), and might use up an asynchronous memory access that other parts of your code might be saturating (although I honestly am not certain if ARM does this like other architectures do). It thereby comes down to circumstance, configuration, and the whims of the people who wrote your compiler as to whether you will get mov+movt or a single ldr pc,.X+data.

raverbashing · on July 22, 2013

I'm not disagreeing, just mentioning

"It thereby comes down to circumstance, configuration, and the whims of the people who wrote your compiler as to whether you will get mov+movt or a single ldr pc,.X+data."

Yes, this sums it nicely

nkurz · on July 21, 2013

I'm very interested in how dynamically generated code interacts with instruction and micro-op caching on modern processors. I'm working on intersecting lists of compressed integers at really quickly, and JIT-like approaches seem to be the frontier for fast decompression schemes. It seems likely you've thought about this. Any suggestions?

http://stackoverflow.com/questions/17738154/how-can-i-write-...

haberman · on July 21, 2013

> I'm very interested in how dynamically generated code interacts with instruction and micro-op caching on modern processors.

My basic mental model is that whenever you write to an executable page you pay a significant overhead. Basically all of the performance caveats that apply to self-modifying code would apply here; a good starting point might be http://en.wikipedia.org/wiki/Self-modifying_code

In my application the code generation is a one-time up-front cost that is easily amortized over the subsequent execution, so I haven't had a need to explore this question in more depth.

> JIT-like approaches seem to be the frontier for fast decompression schemes.

Interesting, I have been feeling lately like this might be an area where JIT-like approaches could yield a big benefit, but hadn't seen any actual work in this area. Do you have references to anyone doing any kind of work like that?

thedigitalengel · on July 22, 2013

I don't know if this counts, but the linux kernel famously uses a JIT to filter network packets:

https://github.com/torvalds/linux/blob/master/arch/x86/net/b...

majke · on July 21, 2013

I thought this is what trampolines are:

http://en.wikipedia.org/wiki/Trampoline_(computing)

http://gcc.gnu.org/onlinedocs/gcc/Nested-Functions.html

FFcall can be used to do simillar things: http://www.gnu.org/software/libffcall/

It has four subprojects:

* avcall - build a C argument list incrementally and call a C function on it.

* vacall - C functions called with variable arguments

* trampoline - closures as first-class C functions

* callback - closures with variable arguments as first-class C functions

[update] used outdated link to http://www.haible.de/bruno/packages-ffcall.html

thedigitalengel · on July 21, 2013

On llvm: http://llvm.org/docs/LangRef.html#trampoline-intrinsics

popee · on July 22, 2013

Didn't know about GNU ffcall until now, so thanks, very useful stuff.

spc476 · on July 21, 2013

I use TCC for this type of thing. TCC is a small, fast C99 compiler that is also a library (so it can be embedded in an application). It can compile code into memory which can then be called (in fact, the same code can be re-compiled multiple times, each time resulting in new code).

On the down side, it only supports x86, X86-64 and Arm architectures.

mmphosis · on July 21, 2013

http://bellard.org/tcc/

zbowling · on July 21, 2013

Trampolines do this.

A good example is that this happens in objective-c all the time when you get a block imp. The closure from the block is wrapped into a new trampoline so the you can use the block as any normal imp but with the closures state stored correctly.

You can actually do this really easily and in a cross platform way with libffi.

What is ingesting is if you are a secured platform, you might not be able to mark a page as executable and writable so this isn't always possible. On iOS we had to do an amazing hack with a .S file and the vm_remap call on Darwin. Libffi uses a cute technique as well.

comex · on July 21, 2013

I still wish Apple blocks were standardized in C - like C++ lambdas, but they have a real type and can be kept alive after the containing function has ended, and they come with block versions of the various C standard library functions that take a function and argument (including atexit, though not signal).

saurik · on July 22, 2013

You can keep a C++11 lambda alive after the stack frame disappears using copy semantics (the = modifier), which is what blocks are doing anyway; C++11, however, gives you more control, as whether a copy or a reference is taken is a property of the lambda, not the variable (as it is when using the __block modifiers). Also, not having a nameable type is a non-issue, due to templates, auto, and std::function (which is, in a way, the standard type you would actually use to deal with a lambda). If anything, lambdas have more type, as the type of a Block argument is erased in the type signature, even though it is required for the calling convention.

comex · on July 22, 2013

> You can keep a C++11 lambda alive after the stack frame disappears using copy semantics (the = modifier), which is what blocks are doing anyway

Ah, yes. Blocks allow you to keep mutable data alive without specifically moving them to the heap; in any case, simply having them is more important than those differences. I am not sure what you mean about types being erased.

saurik · on July 22, 2013

C++11 supports mutable lambdas. The only thing blocks seem to be able to do that C++11 lambdas do not is that multiple blocks are able to share mutable data (which at least makes the decision to make __block a storage class not stupid).

As for the type erasure, a block is pragmatically an instance of NSBlock, which is represented as "?" in the type signature and is morally equivalent to an object id. There is therefore an implicit conversion from blocks of all types to id.

Meanwhile, even in C++, there is an implicit conversion from id to blocks of any type signature. As blocks are objects and objects often need to be smuggled around, this leads to messy and non-obvious type conversions in programs that use them.

From a language design perspective, this fails to take into consideration the history of Objective-C; while object type casting in Objective-C is sloppy, selectors are supposed to fairly uniquely map to a specific type signature.

The reason this is important is that if you have a random object, you want to be able to send it an arbitrary message and not be concerned you got the type signature wrong; the worst that can happen is that the object doesn't understand.

However, blocks mix together the concept of invocation with selectors both physically (you can even send it an invoke message) and conceptually (as you can think of the call as a special selection syntax) and yet break this type guarantee.

Even worse, blocks didn't even have their type encoded at all when they were first designed, so in addition to this being awkward at compile time, it was even fundamentally impossible to catch this kind of type error at runtime.

Thankfully, they fixed that (although now blocks compiled with older compilers, which I believe Apple simply counts gcc in its entirety as these days, won't have this information), but they did not also solve any of the other type issues.

comex · on July 22, 2013

> C++11 supports mutable lambdas. The only thing blocks seem to be able to do that C++11 lambdas do not is that multiple blocks are able to share mutable data

Blocks are also a bit simpler if you want to continue to mutate the same variable both inside and outside of a lambda (but then have the variable stay alive after the outer function exits).

> As for the type erasure, a block is pragmatically an instance of NSBlock, which is represented as "?" in the type signature and is morally equivalent to an object id. There is therefore an implicit conversion from blocks of all types to id.

All objects are also represented with the same character, and all function pointers are ^?. This may be unfortunate, but isn't really about blocks; in pure C, of course, they are only erased inasmuch as all types are erased.

> The reason this is important is that if you have a random object, you want to be able to send it an arbitrary message and not be concerned you got the type signature wrong; the worst that can happen is that the object doesn't understand.

How do you mean? If you have a random `id`, or possibly an object casted to the wrong object type (isn't that the equivalent scenario, since you're talking about implicit casting from id to blocks of any type signature?) and send it a message for a selector it implements but with different types, the call will go through and probably crash, same as if you cast a block to the wrong type.

saurik · on July 24, 2013

> ...and send it a message for a selector it implements but with different types...

I am not certain how to best correct the misconception here. Yes: if you do this, it will crash. However, as I stated, the history of this language sets us up so that this should not happen: a selector should uniquely identify a calling signature to the best of the abilities of the programmers involved (and generally it will by accident for anything but very short names).

The reality is that this is actually not just convenient but somewhat required, as the compiler has to select a specific calling convention at compile time, so if you are using an object in a situation where it is even remotely ambiguous you get a compiler diagnostic if there is any ambiguity in the libraries you are working with with respect to the selector's type signature.

Again, with blocks, this all breaks: you can't even get a warning, because the user has to cast it first to a specific calling convention, and it isn't obvious in the code, because the cast is allowed to be implicit. The chance of an overlap is high, because the conceptual selector's name is effectively nothing. Objective-C mitigates this problem elsewhere: not here.

> All objects are also represented with the same character, and all function pointers are ^?. This may be unfortunate, but isn't really about blocks; in pure C, of course, they are only erased inasmuch as all types are erased.

A) It is not relevant that the class of an object is erased, because what's important is the type of a message implementation: you don't break the type system by having an object of the wrong class, only by sending a message with the wrong calling convention. It doesn't matter at all, therefore, whether objects are all "erased": there wasn't information there to begin with.

B) It is not relevant that C function pointers have their type erased, because they aren't also objects: the whole point of objects in Objective-C is that they are dynamically typed (back to point A: their type is irrelevant), so you are expected to use them as arguments to things like withObject: and then let them implicitly cast back on the other side. Again: selectors are king.

C) It is not relevant that in pure C types are "erased" as C does not have a runtime, so I'm not even certain how one could determine the difference between the types being erased and the types not being erased. The erasure wasn't even the core problem, remember: the erasure only makes it impossible to have runtime fixes for the actual problem, which is the compile-time selector slop.

comex · on July 24, 2013

> However, as I stated, the history of this language sets us up so that this should not happen: a selector should uniquely identify a calling signature to the best of the abilities of the programmers involved (and generally it will by accident for anything but very short names).

I see what you're saying... screwing up an id cast will usually not cause low-level indeterminacy (even though it might), but screwing up a block cast always will. This seems like a pretty minor issue to me, especially as blocks usually need not be passed as ids in the first place, and Objective-C is hardly a safe language in other respects (at least historically, with retain/release).

mihai_ionic · on July 21, 2013

C++14 version: https://dpaste.de/a8g9D/

You obviously can't pass this to C code, but it's a portable and extremely efficient solution if you are working purely within a C++ code-base.

panic · on July 21, 2013

I don't think "curry" is quite the right name for that function. For a function

    int f(int, int, int);

the curried version of f,

    auto f_curried = curry(f);

must be able to take multiple arguments one-at-a-time, so

    f_curried(x)(y)(z) == f(x, y, z)

The code at https://dpaste.de/a8g9D/ just partially applies the function to one set of arguments, which isn't enough -- you have to keep partially applying until you get them all.

mihai_ionic · on July 22, 2013

I knew I should have just called it bind.

As a bonus, an actual implementation of curry: https://dpaste.de/qGWcq/

mappu · on July 21, 2013

I used the same approach: https://gist.github.com/anonymous/3154939

huhtenberg · on July 21, 2013

Sure, a good hack, upvoted.

Though, not to state the obvious, but if you find yourself in a need a context for an atexit() callback, then you are doing something wrong. Similarly, if a library function accepts just a function pointer, then it's either a global callback, or it's a mistake in a library design.

chacham15 · on July 21, 2013

Can someone explain to me why you would do this very hacky thing instead of:

    void *global_ptr;
    int callback(){
        realcallback(global_ptr);
    }
    int main(){
        global_ptr = whatever;
        atexit(callback);
        return 0;
    }

to3m · on July 21, 2013

Let's assume that realcallback has no use other than to be a callback function. Then, this suggestion is not much better than simply having realcallback use global_ptr directly.

Now say you wanted to have one invocation of realcallback with one argument, and another invocation with another. Then it makes sense for realcallback to exist - but you would still need two callback functions (and two globals). And if you needed three different invocations, you'd need three, and so on. This is all easy enough to do, but not extensible in any data-driven fashion.

The hack generates a separate thunk for each value, but at runtime, so you have as many callbacks as you have context arguments, and without needing to write them by hand ahead of time.

(Fixing the calling code is a much better solution, where available. You virtually always need a context parameter, and when you don't, it's easy enough to ignore. "Solving" the problem with this sort of ugly hack is just making more work for yourself and for others.)

thedigitalengel · on July 22, 2013

For multiple functions taking different arguments you could exploit atexit's specified calling order

    data_t *global;
    
    int main() {
      atexit(callback);
      atexit(set_global_to_x);
      atexit(callback);
      atexit(set_global_to_y);
    }

Since functions are called in reverse order of their installation using atexit, you end up with two calls to callback; one with global set to y and one with global set to x.

Someone · on July 21, 2013

You wouldn't get on HN with that :-)

It might be more useful in cases where you want to have multiple callbacks, and don't know how many at compile time. I would still try and avoid it, though.

(And aside, I would make global_ptr static. That makes it invisible from outside its compilation unit. Even more pedantic: callback should be a void function)

dschiptsov · on July 22, 2013

I might be wrong, but the concept of partial application implies existence of lexical closures and environment model of evaluation, so partially applied function becomes a closure over its argument. This is how currying is useful. This article, then, describes dome nice hack, but I can't see how it is related to partial application as we know it.

im3w1l · on July 21, 2013

Good thing he patches the function constant before the parameter constant, or there could be a denial of service attack if the function argument can be chosen (to abad1dea) by an attacker.