ABIs tend to implicitly pass by reference anyway for structs larger than two words or so. (For example, in the AMD64 ABI, two 64-bit words is the cutoff for both arguments and return values. On 64-bit Windows, the limit is one word for both, and on 32-bit ARM, it's four for arguments but one for return values.)
What the parent is saying is that no, the struct will be loaded into registers and passed there. Here is a simpler example: https://godbolt.org/g/DcHgMr
As you can see, structs used to be on the passed stack on x86-32. Most other/newer architectures use ABIs that pass the initial struct elements in registers.
There are two orthogonal things going on. And I was actually mistaken as to certain specifics. So if only to clear up my own confusion, let me go into a somewhat unnecessary level of detail. TL;DR, though:
- To answer ericbb's concern, the compiler will make copies as necessary to preserve pass-by-value semantics; it just sometimes looks like pass-by-reference at the assembly level. (But "pass on the stack" isn't quite the right wording either; I was referring to a special case of that.)
- Only very small structs are worth passing or returning by value, with the exact limit depending on the platform (and often unnecessarily low).
Anyway.
First, as you (tom_mellior) mention, most ABIs have a list of registers used for function arguments; if there are too many arguments (or on 32-bit x86, any arguments), the remainder are placed on the stack, starting at a fixed offset from where the stack pointer points at function entry.
Second, a struct argument can be passed to a function in three different ways.
The fastest way is to pass each struct field in order as if it were a separate function argument.[1] Thus each field could go in a register or, if all the registers have been used, on the stack. As I said, this is often done only for structs under a certain size: 1 word on Windows x86-64, 2 words on ARM64 and non-Windows x86-64. But actually, some ABIs have no limit, like 32-bit ARM and some less popular architectures. (Thus I shouldn't have said ARM's limit was 4. 4 is the number of registers available for arguments in total, but that's different. The ABI doesn't change behavior for individual arguments based on the struct size. And even a huge struct can have the first few fields passed in registers.)
Another way is to pass a pointer to the struct in place of the struct itself. This is what happens on Windows x86-64 (for structs larger than 1 word) and ARM64 (for structs larger than 2 words). The semantics are still pass-by-value, so the caller generally needs to make a copy of the struct on the stack (at least if the original value is used afterward), but that's its own business; the callee just sees a pointer.
This is what I meant by "implicitly pass by reference", at least for arguments (see below about returns). At best, it's identical at the assembly level to using a real pointer argument, and thus equally efficient; at worst, it's less efficient, since the compiler may emit a copy when it could have just passed a pointer to an existing copy. This might be because the language semantics don't guarantee the existing copy won't be modified, but it can happen even if they do: struct arguments/returns aren't that popular in C and C++, so even modern compilers don't always optimize them as well as they could. This is a problem for Rust.
The third way - which I've only seen on non-Windows x86-64, and which I had confused with the previous one - is to pass the struct like excess arguments, except forced to be on the stack, skipping over any registers that may be available. Thus, if you have
void func(int x, struct foo foo, int y);
then 'x' and 'y' would be put in the first two registers, while 'foo' would go on the stack, at the aforementioned fixed offset. On the other hand, if there were a lot of preceding arguments that already filled up all the registers:
void func(int a, int b, int c, int d, int e, int f, int x, struct foo foo, int y);
...then 'foo' would have to be on the stack anyway, so the two-word limit makes no difference. The stack (starting at the fixed offset) will contain 'x', then 'foo', then 'y'.
Anyway, in this case too it's usually equally efficient to use a real pointer, though it's not identical at the assembly level. If you use a real pointer, and the pointer fits into a register, then to load a field of a struct, the function has to load from that register, which is one instruction:
mov %rax, 0(%rdi)
If you pass by value, and it's forced onto the stack, then the function has to load from the stack, which is still one instruction:
mov %rax, 8(%rsp)
Though if the pointer itself ends up on the stack, then of course it needs two instructions.
Separately, performance can also differ greatly due to the cache. In general, the top of the stack is almost certainly in L1 cache, while some random pointer might not be. But if the choice is between the callee loading the pointer versus the caller loading the same pointer to copy it to the stack, then that doesn't really matter.
...Then there's return values, which are a bit simpler, since each function has only one return value. Every ABI I know of uses uses either one or two registers for return values. (Even legacy x86 uses registers for returns, rather than the stack.) The limit for struct return values is zero, one, or two words. Larger return values are done by having the caller pass an out pointer, usually treated like an extra function argument. So here too, for large structs it's at most equally efficient to use an explicit out pointer, often more efficient.
Note that annoyingly, the limit is often smaller than you'd expect from the number of registers available. On 32-bit x86, struct return values always go by out pointer, even if the struct contains a single int. On 32-bit ARM, single-field structs are OK, but annoyingly, a struct containing two 32-bit values cannot be returned in two registers, even though two registers are used in other cases - namely, to return a single 64-bit value.
[1] Alternatively, the struct may have its original layout divided into words, and have those words passed as separate arguments. The difference comes when struct fields are smaller than machine words. If you have
struct example { uint32_t a, b; };
then on non-Windows x86-64, passing 'struct example' as an argument will pack both 'a' and 'b' into a single 64-bit register (rdi), whereas if you have two uint32_ts as standalone arguments, they will be passed in separate registers (rdi and rsi).
Instead of passing the struct in registers (rdi, rsi, rdx) or passing a pointer to the object (pass by reference), the caller places the object on the stack and the callee knows where to find it just based on the type of the function (it's at [rbp+0x10]).
On the other hand, if you delete the c field from the struct type, then the beginning of foo() looks like this:
In that case, the struct was passed in via registers rdi, rsi (similar to tom_mellior's example) and subsequently copied to the stack (at [rbp-0x10] this time).
So, in neither case would I consider it "pass by reference" because neither leads to a pointer to the x object being passed in as an argument.
It's interesting to learn that Windows does it differently and does in fact use what I'd call "pass by reference". The only ABI I'm (somewhat) familiar with is Linux x86-64.
Your point about the annoying limitations on return values is a good one too. I had the thought at one point that it'd be possible to uniformly use structs with just one field instead of typedefs but the limitations on return value calling conventions mean that the two alternatives do not lead to equivalent code, surprisingly.