I have in the past had this very romantic belief that when you pass a struct by value the compiler will lovingly select the right combination of registers to carefully craft the perfect packing of your arguments, for maximum efficiency. Unfortunately, that is not how reality works. This post is a short introduction to calling conventions with examples.
Here is the code we are going to look at. We’ve got structs, we’ve got functions that take those structs as parameters, and we’ve got functions calling those other functions. The only noteworthy thing is that the outer functions (g1
, g4
) also do some work that does not exist in the inner functions, so the compiler actually has to emit a call to the inner functions (instead of just jumping to them):
struct F1 {
float x0;
};
struct F4 {
float x0;
float x1;
float x2;
float x3;
};
void f1(F1);
void f4(F4);
int g1(int x) { f1({}); x++; return x; }
int g4(int x) { f4({}); x++; return x; }
Let’s look at the codegen for both MSVC (19.43) and Clang (20.1.0) using Godbolt (link), both with full optimizations (-O2
, /O3
).
The F1 struct
Before we can look at passing the F4
struct, let’s understand the case of the smaller struct. For g1
on MSVC we find this:
$T1 = 48
int g1(int) PROC
push rbx
sub rsp, 32
xor eax, eax
mov ebx, ecx
mov ecx, eax
mov DWORD PTR $T1[rsp], eax
call void f1(F1)
lea eax, DWORD PTR [rbx+1]
add rsp, 32
pop rbx
ret 0
int g1(int) ENDP
Clang produces this somewhat shorter version:
g1(int):
push rbx
mov ebx, edi
xorps xmm0, xmm0
call f1(F1)@PLT
inc ebx
mov eax, ebx
pop rbx
ret
“Aha! Clang good, MSVC bad”, but no, it is not that simple. What we see here is not so much a difference in the quality of the compiler but the difference in what we are targeting: MSVC assumes we are compiling for x64 Windows, and clang assumes that this is probably some x64 Linux target, and it turns out that how you call a function and pass parameters to them is different between them: The platforms have a different ABI with different “calling conventions” (how you call a function). So just because you can write some sequence of valid machine instructions does not mean that this satisfies all of the assumptions and requirements of the underlying operating system.
What is MSVC doing?
Let’s look a little bit more closely at the code produced by MSVC:
- MSVC starts off by pushing
rbx
to the stack. At the end of the function,rbx
is popped off the stack again, so we are effectively “saving” the original value ofrbx
to the stack and restore it at the end of the function. - Then we allocate 32 bytes of stack space (
sub rsp, 32
), which we release again at the end of the function (add rsp, 32
). Note that we don’t touch much of the stack space we allocate. - Then we set
eax
to zero (xor eax, eax
), moveecx
intoebx
(mov ebx, ecx
), and puteax
intoecx
. This sequence makes sense if you understand the calling convention: Windows x64 mandates that the first argument to a function sits inecx
. So the initial use ofecx
is to hold the parameterx
, which we put intoebx
. Then we put zero intoecx
before callingf1
, because we callf1
with a zero-initialized struct. - Then we also write zero to the stack
mov DWORD PTR $T1[rsp], eax
. The syntax is a bit funky but essentially writes torsp + 48
, which is outside of the stack space we just allocated. Weird! - After the call, we add one to
rbx
and store the result ineax
(lea eax, DWORD PTR [rbx+1]
). It is very common to uselea
(“Load Effective Address”) to perform computation, and while it might look like we are reading from memory and use a pointer (square brackets!DWORD PTR
!), none of this is happening. That’s just there because the original intention forlea
is to compute offsets from a pointer. All of this makes sense once you understand that the calling convention prescribes that integer return values be ineax
, andg1
returnsx + 1
.
This leaves three questions:
-
What’s up with
rbx
? If you look closely,g1
puts its argument intoebx
before callingf1
, and then oncef1
is done we can just assume thatrbx
(the 64bit version ofebx
) still contains the value we moved into it before the call. That only works because the calling convention guarantees that arbx
must be the same after a function call. This also explains whyg1
pushesrbx
to the stack and pops it off again at the end:g1
must abide by the ABI as well and must guarantee to callers that on function exitrbx
is unmodified. In Windows x64 ABI terminology “rbx
is a non-volatile register”. There are also volatile registers, where all bets are off after a function call. -
Why do we allocate 32 bytes of stack space? then write just a little bit of zero to it? On Windows x64, any function you call may assume that before it has been called, the caller has allocated enough stack space for it to be able to write all of its arguments to the stack without allocating new stack space. The minimum amount that is ever allocated is 32 bytes, which corresponds to 4 arguments passed in 64bit registers.
-
Why do we write to the stack outside of the stack allocation? As just explained, every function is guaranteed to have enough space on the stack right before it is called to store at least four 8 byte values (so it can save its arguments). The particular slot this write is targeting is that of the first argument. So at least we know where we are writing to. I do not have a good answer for why this write is happening, and I would attribute it to bad codegen. Notably this write disappears when you do
f1({1})
instead, or when you makeF1
contain an integer instead of a float.
What we have not yet touched upon is that the struct F1
contains a float, but MSVC decided to put it into an integer register (rcx
). While the Windows x64 ABI clearly states that you need to pass floating point arguments in XMM
registers, this does not apply here: We pass a struct containing a float, not the float itself, and that struct just happens to fit into a register.
Relevant reading are the MSDN pages on x64 calling conventions and on stack usage.
What is clang doing?
We will move a bit quicker here. This code is compiled for Linux x64, and Linux x64 uses the SystemV ABI. By coincidence, rbx
is also non-volatile there, which is why it is used in the same way as on MSVC. The argument for f1
is passed in XMM0
. The rules for determining which arguments are passed and how are more complicated on System V but in this case result in using an SSE register.
The System V ABI is documented in the PDF on this repository (it took me multiple unsuccessful attempts to understand it, it is less straight-forward than the Windows x64 ABI).
The F4 struct
Now for the larger struct. Here is what MSVC is doing:
int g4(int) PROC
push rbx
sub rsp, 48
mov ebx, ecx
xorps xmm0, xmm0
lea rcx, QWORD PTR $T1[rsp]
movdqa XMMWORD PTR $T1[rsp], xmm0
call void f4(F4)
lea eax, DWORD PTR [rbx+1]
add rsp, 48
pop rbx
ret 0
int g4(int) ENDP
Most of this is identical to F1
. The difference is that it is clearing XMM0
(xorps xmm0, xmm0
), then writes it to the stack, and loads the address of the value we wrote to the stack into rcx
. It is impossible to tell from this specific callsite alone, but this is what is happening: The calling convention does not allow you to pass types larger than 64bit in a register. Our type is 128bit, which would fit into XMM0
, but by convention we can’t use that. We instead have to put the argument onto the stack (that’s where the 16 extra bytes of stack allocation come from) and then load its address into the first argument rcx
and pass the pointer to the function.
Clang has a different take on this:
g4(int):
push rbx
mov ebx, edi
xorps xmm0, xmm0
xorps xmm1, xmm1
call f4(F4)@PLT
inc ebx
mov eax, ebx
pop rbx
ret
This is almost identical to the F1
case, except that we are now using two registers to pass our struct. This is surprising: Our struct fully fits into a single XMM
register. Why do we use two, of all things? Well, the System V ABI works on eight byte chunks when considering structs. This struct is 16 bytes total, and both of the chunks can be passed in an SSE register.
Closing thoughts
As you can see, neither calling convention ends up putting the four floats into a single register, even though a x64 machine is always going to have 128 bit SSE registers available. Structs with integers in them would not be able to use XMM registers anyway and would end up going through the stack or split across 64bit registers (if <= 128 bits on System V, and if <= 64 bits on Windows).
What could we do different?
- For this specific case on System V, you can use
__m128
directly. System V special-cases__m128
. - For this specific case on Windows x64, you can use
__m128
and mark your function as__vectorcall
. This is an alternative calling convention that is available on x64, which will pass__m128
in registers. Otherwise it will go via the stack. (In pre-x64 days, lots of different conventions existed, but now it’s just two on Windows.) - Inline the function aggressively.
- Ignore the calling convention and write the assembly out manually.
The last point should be approached with a lot of caution. It is clear that every function call is a data-passing bottleneck and you can probably do better in every single case, but that is probably only worth it in hotspots – applied everywhere at scale, but then you are paying for it. For example, the Go compiler on Windows generates code that is not compatible with the Windows x64 calling convention. To my knowledge, this was done to be able to efficiently implement their version of coroutines, which are a central feature of the language. The cost of that is that typical stack-unwinding code stops working for Go-code, and entirely classes of tools stop working (e.g. anything based on Event-Tracing for Windows (ETW)). The C# runtime CoreCLR is also side-stepping the default ABI (on ARM64) for commonly used writes barriers.