A C implementation of objc_msgSend would look like: ... objc_msgSend(id self, SE...

lgg · on July 1, 2017

It actually is not the extra function call that is the big hit, since if you think about it objc_msgSend also does two calls (the call to msgSend, which at the end then tail calls the imp). The dynamic instruction count is also roughly the same.

In fact objc_msgLookup actually ends up being faster in a some micro benches since it plays a lot better with modern CPU branch predictors: objc_msgSend defeats them by making every call site jump to the same dispatch function, which then makes a completely unpredictable jump to the imp. By using msgLookup you essentially decouple the branch source from the lookup which greatly improves predictably. Also, with a “sufficiently smart” compiler it can be win because it allows you to do things like hoist the lookup out of loops, etc (essentially really clever automated IMP caching tricks).

There are also a number of minor regressions, like now you are doing some of the work on a stack frame (which might require spilling if you need a register, vs avoiding spills by using exclusively non-preserved registers in an assembly function that tail calls). In the end what kills it is that the profiles of most objC is large flat sections that do not really benefit from the compiler tricks or the improved prediction, and the added call site instructions end up in increased binary sizes and negative CPU i-cache impacts.

mikeash · on July 2, 2017

Interesting! Making two separate calls at the call site would have some extra overhead compared to what objc_msgSend does. The caller needs to load the self and _cmd arguments twice, for example, and stash the IMP somewhere convenient in between the two calls. If objc_msg_lookup has a standard prologue and epilogue then you'll end up running two of those each time. You'll push and pop two return addresses on the stack rather than just one.

However, I'll happily accept that these are probably pretty small costs, especially since so much of it is just register gains which probably result in cost-free renamings in the hardware. It makes sense that the i-cache impact is more important.

dfox · on July 1, 2017

Having lookup instead of send as the primitive operation also allows you to generate code like this for the call site:

    ({
      static SEL last_isa = NULL;
      static IMP last_imp = NULL;
      if (object->isa != isa){
        last_isa = object->isa;
        last_imp = lookup(object->isa, sel);
      }
      last_isa(object, selector, arguments...);
    })

(modulo the fact that you cannot generate this by dumb string substitution without compiler extension like gcc's ({...}))

Smalltalk/X takes this to the extreme by compiling all sends into code like:

    {
       static struct cache = {.imp = &magic_global_method, .class=NULL}
       cache->imp(&cache, object, selector, arguments...);
    }

And generates something like this into every method prologue:

    if (cache && self->isa != cache->class){
      cache->class = self->isa;
      cache->imp = lookup(object, selector);
      return cache->imp(NULL, object, selector, arguments...)
    }

It looks convoluted and uses one additional word of stack space per call, but does not contain any unpredictable indirect branches in the fast path (and in fact reduces overall code size as it can be expected that there are many more sends than methods).

mikeash · on July 2, 2017

That is very cool. Can this approach be made thread safe while still being fast?

dfox · on July 2, 2017

It is safe as long as everything that can get into the cache starts with the validity checking prologue and there is only one thread. Making this thread-safe is probably non-trivial.

tom_mellior · on July 1, 2017

> There's no way to express that args... argument when calling the function pointer

Yes there is: va_list.

> no way to express forwarding an arbitrary return value

Of course there is, and lots and lots of language runtimes implemented in C use those ways. Usually it boils down to having a base type called Object or Value and passing around pointers to that. In fact, from your example it looks like the "id" type is meant to play this role.

This is not syntax checked, but the code above would be something like:

    Object *objc_msgSend(id self, SEL cmd, ...) {
        fptr = ...lookup code...
        va_list args;
        va_start(cmd, args);
        Object *result = fptr(self, cmd, args);
        va_end(args);
        return result;
    }

Yes, this can be faster in assembly, but it's not true that there is no way to express this. (Unless I'm misunderstanding something.)

mikeash · on July 1, 2017

These are ways to simulate it. Of course you can simulate it; the language is Turing-complete, after all. But it does not actually do it. You can write something similar to objc_msgSend in C, but you cannot write objc_msgSend in C.

Using varargs and passing va_list into the method would mean that your method is no longer a plain C function with the declared parameters plus two hidden parameters. It's now a different sort of beast, and has to use va_ calls to extract the values. This would require a lot more work in the method, and hurt performance.

Returning everything as an object would mean boxing and unboxing primitive values at every call, which would be horrendously inefficient.

And if you don't care about extracting every last bit of performance, it's much easier to do the lookup approach I discussed than it is to faff around with varargs and wrapping return values.

CodeWriter23 · on July 1, 2017

@mikeash it looks like you might have a topic for an upcoming Friday.

revelation · on July 1, 2017

va_* is not a simulation. It compiles down to the exact same stack accesses. There is no list. It is a plain C function. It is the same calling convention. No boxing.

This is plain false.

lgg · on July 1, 2017

It depends on the platforms C ABI, but no, the argument marshaling for va_args is not necessarily (or even usually) the same as normal args. In the case of iOS you can look here[1], the relevant bit being: "The iOS ABI for functions that take a variable number of arguments is entirely different from the generic version."

This actually manifests in errors if you directly call objc_msgSend, which is why in order to guarantee direct codeine you need to cast objc_msgSend to the actual prototype you want[2]:

"An exception to the casting rule described above is when you are calling the objc_msgSend function or any other similar functions in the Objective-C runtime that send messages. Although the prototype for the message functions has a variadic form, the method function that is called by the Objective-C runtime does not share the same prototype. The Objective-C runtime directly dispatches to the function that implements the method, so the calling conventions are mismatched, as described previously. Therefore you must cast the objc_msgSend function to a prototype that matches the method function being called."

1: https://developer.apple.com/library/content/documentation/Xc... 2: https://developer.apple.com/library/content/documentation/Xc...

revelation · on July 2, 2017

This is C, I'm talking C calling convention (and x64, which is the same). Caller cleans up the stack, so va_list is a zero cost abstraction.

Citing the bastard architecture of iOS isn't really making the case for "usually".

mikeash · on July 2, 2017

Requiring the caller to put all arguments on the stack isn't "zero cost." For a non-variadic call on ARM64, the first eight parameters (or more, if some are floats) will be passed in registers without ever touching the stack.

On x86-64, the caller also has to set %al to the number of vector registers used for the call, and the compilers I've seen always check %al and conditionally save those registers as part of the function prologue. Cheap, but not "zero cost."

revelation · on July 2, 2017

va_ doesn't change the calling convention. Parameters passed as registers continue to be passed as registers.

We could probably argue this some more but I suggest you simply try it with a compiler..

mikeash · on July 2, 2017

Good idea!

https://gist.github.com/mikeash/ce38d3a77b88734a9e0e9dc3f352...

You'll notice how `normal` takes all of its arguments out of registers `x0` through `x7` and places them on the stack for the call to `printf`. And you'll notice how `vararg` plays a bunch of games with the stack and never touches registers `x1` through `x7`. (It still uses `x0` because the first argument is not variadic.)

On the caller side, observe how `call_normal` places its values into `x0` through `x7` sequentially and then invokes the target function, while `call_vararg` places one value into `x0` and places everything else on the stack.

So, no, it looks to me like varargs very much change the calling convention.

mikeash · on July 1, 2017

The "exact same stack accesses" as reading arguments directly from the registers they're passed in?

revelation · on July 2, 2017

Now you're playing ignorant. Feel free to substitute stack accesses with register reads, but since we're talking "variable args" I feel you're going to run out of those quickly.

mikeash · on July 2, 2017

I'm not playing ignorant, I'm pointing out a very real difference between reading variadic arguments with va_arg and reading normal arguments with plain code. Normal arguments typically get read straight out of their corresponding registers, whereas va_arg reads from a stack entry. It is not the exact same code and it is not the same calling convention.

Please don't say things like "This is plain false" when you say things like this which are, well, just plain false.

tom_mellior · on July 1, 2017

> These are ways to simulate it.

Come on. Taking an argument list in an argument list object and passing that argument list as an argument list to a function is exactly what this was about. It's not a "simulation". It's a C feature for capturing and passing around argument lists. It actually does it.

> has to use va_ calls to extract the values. This would require a lot more work in the method

Loads from the stack at fixed offsets for every argument instead of having some of the arguments in registers and loading others from the stack at fixed offsets. Yes, that is more work.

> Returning everything as an object would mean boxing and unboxing primitive values at every call, which would be horrendously inefficient.

True, the runtimes I was thinking of box many things. But you can type-pun pointers to other things, so you don't necessarily have to box everything. I don't know enough about Objective-C's constraints, but I do note that the linked article did talk about using tagged pointers already.

mikeash · on July 2, 2017

Come on yourself. I'm talking about the what the actual objc_msgSend actually does. And to make sure I'm clear, what it actually does is get called with arbitrary parameters, pass those arbitrary parameters on to an unknown function implemented to take them as standard C parameters, and then that unknown function returns an arbitrary return value back to the caller.

You cannot implement this with plain C. That's a simple fact. If your idea doesn't work for a method that, say, takes a double and returns an fd_set as raw C types then your idea doesn't do what objc_msgSend does.

Yes, you can shift the problem around and come up with a system that you can implement in plain C. I outlined one approach for that, and you've outlined another. Nothing wrong with that, but it's not solving the same problem. So feel free to elaborate on other ways that it could be done, but don't tell me I'm wrong because you've come up with a way to solve a similar but different problem.

connorcpu · on July 1, 2017

The missing bit is that objc_msgSend doesn't know how many parameters are being forwarded, and fptr is just a normal function on the other end, it isn't expecting a va_list, it expects arguments to be passed in the C ABI exactly how they're passed to objc_msgSend

icodestuff · on July 1, 2017

fptr doesn't take a va_list, it takes the actual arguments. Also this would leave useless objc_msgSend stack frames at every other level of the stack. There's no way to force the compiler to generate a tail call from inside C. Additionally, you'd have to have callers unbox primitives when fptr returns a C type - the language specifies being a superset of C, so all the C types that the compiler otherwise supports have to work. "id" is not a supertype of int, float, etc.

And yes, it is significantly faster. Avoiding writing assembly seems like an awfully odd goal to have for a language runtime.

chrisseaton · on July 1, 2017

> fptr(self, cmd, args);

But fptr doesn't take a va_list - it takes the actual arguments.

tom_mellior · on July 1, 2017

Presumably, as an implementor of an Objective-C compiler, I would choose how to compile methods, and I might just choose to compile methods to functions taking (something memory-compatible with) va_list.

plorkyeran · on July 1, 2017

The functions which are called by objc_msgSend do not have to be methods compiled by an obj-c compiler. You can add functions compiled by a compiler with no knowledge of or support for obj-c to an obj-c class at runtime, and then call that method via objc_msgSend.

Obviously you could come up with other ways of passing arguments to obj-c methods which would make it possible to implement your message send function in pure C, but a message send function which passes the arguments as a va_list is not objc_msgSend(), and that says nothing about whether or not objc_msgSend() could be implemented for the design they did go with in pure C.

icodestuff · on July 1, 2017

But then you can't call those methods from C yourself (expecting va_list, received arguments). And what about adding methods to classes from plain C functions? Do you duplicate those functions with a va_list version? Seems like that'll add quite a bit of bloat.

tom_mellior · on July 1, 2017

<shrug> I corrected a post saying "X is impossible in C". That's a different issue from whether X is as efficient in C as in assembly language.

chrisseaton · on July 1, 2017

I think the point was that you cannot implement this particular function in C - a function that forwards arguments to other functions, given a particular standard ABI. You've changed the requirements by saying that now the other functions use a different ABI, so you've missed the point of why it's impossible to implement this function, in these circumstances, in C.

tom_mellior · on July 2, 2017

The original question in this thread was whether it was really impossible, under any circumstances, to write this in C, as part of an Objective-C implementation that you fully control. It is possible in C, if you can control the ABI. It's the people who insist on a particular pre-existing ABI who are changing the question.

Anyway, I think I've said all I'm going to say here.