Passing stuff by value and ABIs | Sebastian Schöner

Hey! You can find me on Mastodon and Bluesky!

I have in the past had this very romantic belief that when you pass a struct by value the compiler will lovingly select the right combination of registers to carefully craft the perfect packing of your arguments, for maximum efficiency. Unfortunately, that is not how reality works. This post is a short introduction to calling conventions with examples.

Here is the code we are going to look at. We’ve got structs, we’ve got functions that take those structs as parameters, and we’ve got functions calling those other functions. The only noteworthy thing is that the outer functions (g1, g4) also do some work that does not exist in the inner functions, so the compiler actually has to emit a call to the inner functions (instead of just jumping to them):

struct F1 {
    float x0;
};

struct F4 {
    float x0;
    float x1;
    float x2;
    float x3;
};

void f1(F1);
void f4(F4);

int g1(int x) { f1({}); x++; return x; }
int g4(int x) { f4({}); x++; return x; }

Let’s look at the codegen for both MSVC (19.43) and Clang (20.1.0) using Godbolt (link), both with full optimizations (-O2, /O3).

The F1 struct

Before we can look at passing the F4 struct, let’s understand the case of the smaller struct. For g1 on MSVC we find this:

$T1 = 48
int g1(int) PROC
    push    rbx
    sub     rsp, 32
    xor     eax, eax
    mov     ebx, ecx
    mov     ecx, eax
    mov     DWORD PTR $T1[rsp], eax
    call    void f1(F1)
    lea     eax, DWORD PTR [rbx+1]
    add     rsp, 32
    pop     rbx
    ret     0
int g1(int) ENDP

Clang produces this somewhat shorter version:

g1(int):
    push    rbx
    mov     ebx, edi
    xorps   xmm0, xmm0
    call    f1(F1)@PLT
    inc     ebx
    mov     eax, ebx
    pop     rbx
    ret

“Aha! Clang good, MSVC bad”, but no, it is not that simple. What we see here is not so much a difference in the quality of the compiler but the difference in what we are targeting: MSVC assumes we are compiling for x64 Windows, and clang assumes that this is probably some x64 Linux target, and it turns out that how you call a function and pass parameters to them is different between them: The platforms have a different ABI with different “calling conventions” (how you call a function). So just because you can write some sequence of valid machine instructions does not mean that this satisfies all of the assumptions and requirements of the underlying operating system.

What is MSVC doing?

Let’s look a little bit more closely at the code produced by MSVC:

MSVC starts off by pushing rbx to the stack. At the end of the function, rbx is popped off the stack again, so we are effectively “saving” the original value of rbx to the stack and restore it at the end of the function.
Then we allocate 32 bytes of stack space (sub rsp, 32), which we release again at the end of the function (add rsp, 32). Note that we don’t touch much of the stack space we allocate.
Then we set eax to zero (xor eax, eax), move ecx into ebx (mov ebx, ecx), and put eax into ecx. This sequence makes sense if you understand the calling convention: Windows x64 mandates that the first argument to a function sits in ecx. So the initial use of ecx is to hold the parameter x, which we put into ebx. Then we put zero into ecx before calling f1, because we call f1 with a zero-initialized struct.
Then we also write zero to the stack mov DWORD PTR $T1[rsp], eax. The syntax is a bit funky but essentially writes to rsp + 48, which is outside of the stack space we just allocated. Weird!
After the call, we add one to rbx and store the result in eax (lea eax, DWORD PTR [rbx+1]). It is very common to use lea (“Load Effective Address”) to perform computation, and while it might look like we are reading from memory and use a pointer (square brackets! DWORD PTR!), none of this is happening. That’s just there because the original intention for lea is to compute offsets from a pointer. All of this makes sense once you understand that the calling convention prescribes that integer return values be in eax, and g1 returns x + 1.

This leaves three questions:

What’s up with rbx? If you look closely, g1 puts its argument into ebx before calling f1, and then once f1 is done we can just assume that rbx (the 64bit version of ebx) still contains the value we moved into it before the call. That only works because the calling convention guarantees that a rbx must be the same after a function call. This also explains why g1 pushes rbx to the stack and pops it off again at the end: g1 must abide by the ABI as well and must guarantee to callers that on function exit rbx is unmodified. In Windows x64 ABI terminology “rbx is a non-volatile register”. There are also volatile registers, where all bets are off after a function call.
Why do we allocate 32 bytes of stack space? then write just a little bit of zero to it? On Windows x64, any function you call may assume that before it has been called, the caller has allocated enough stack space for it to be able to write all of its arguments to the stack without allocating new stack space. The minimum amount that is ever allocated is 32 bytes, which corresponds to 4 arguments passed in 64bit registers.
Why do we write to the stack outside of the stack allocation? As just explained, every function is guaranteed to have enough space on the stack right before it is called to store at least four 8 byte values (so it can save its arguments). The particular slot this write is targeting is that of the first argument. So at least we know where we are writing to. I do not have a good answer for why this write is happening, and I would attribute it to bad codegen. Notably this write disappears when you do f1({1}) instead, or when you make F1 contain an integer instead of a float.

What we have not yet touched upon is that the struct F1 contains a float, but MSVC decided to put it into an integer register (rcx). While the Windows x64 ABI clearly states that you need to pass floating point arguments in XMM registers, this does not apply here: We pass a struct containing a float, not the float itself, and that struct just happens to fit into a register.

Relevant reading are the MSDN pages on x64 calling conventions and on stack usage.

What is clang doing?

We will move a bit quicker here. This code is compiled for Linux x64, and Linux x64 uses the SystemV ABI. By coincidence, rbx is also non-volatile there, which is why it is used in the same way as on MSVC. The argument for f1 is passed in XMM0. The rules for determining which arguments are passed and how are more complicated on System V but in this case result in using an SSE register.

The System V ABI is documented in the PDF on this repository (it took me multiple unsuccessful attempts to understand it, it is less straight-forward than the Windows x64 ABI).

The F4 struct

Now for the larger struct. Here is what MSVC is doing:

int g4(int) PROC
    push    rbx
    sub     rsp, 48
    mov     ebx, ecx
    xorps   xmm0, xmm0
    lea     rcx, QWORD PTR $T1[rsp]
    movdqa  XMMWORD PTR $T1[rsp], xmm0
    call    void f4(F4)
    lea     eax, DWORD PTR [rbx+1]
    add     rsp, 48
    pop     rbx
    ret     0
int g4(int) ENDP

Most of this is identical to F1. The difference is that it is clearing XMM0 (xorps xmm0, xmm0), then writes it to the stack, and loads the address of the value we wrote to the stack into rcx. It is impossible to tell from this specific callsite alone, but this is what is happening: The calling convention does not allow you to pass types larger than 64bit in a register. Our type is 128bit, which would fit into XMM0, but by convention we can’t use that. We instead have to put the argument onto the stack (that’s where the 16 extra bytes of stack allocation come from) and then load its address into the first argument rcx and pass the pointer to the function.

Clang has a different take on this:

g4(int):
    push    rbx
    mov     ebx, edi
    xorps   xmm0, xmm0
    xorps   xmm1, xmm1
    call    f4(F4)@PLT
    inc     ebx
    mov     eax, ebx
    pop     rbx
    ret

This is almost identical to the F1 case, except that we are now using two registers to pass our struct. This is surprising: Our struct fully fits into a single XMM register. Why do we use two, of all things? Well, the System V ABI works on eight byte chunks when considering structs. This struct is 16 bytes total, and both of the chunks can be passed in an SSE register.

Closing thoughts

As you can see, neither calling convention ends up putting the four floats into a single register, even though a x64 machine is always going to have 128 bit SSE registers available. Structs with integers in them would not be able to use XMM registers anyway and would end up going through the stack or split across 64bit registers (if <= 128 bits on System V, and if <= 64 bits on Windows).

What could we do different?

For this specific case on System V, you can use __m128 directly. System V special-cases __m128.
For this specific case on Windows x64, you can use __m128 and mark your function as __vectorcall. This is an alternative calling convention that is available on x64, which will pass __m128 in registers. Otherwise it will go via the stack. (In pre-x64 days, lots of different conventions existed, but now it’s just two on Windows.)
Inline the function aggressively.
Ignore the calling convention and write the assembly out manually.

The last point should be approached with a lot of caution. It is clear that every function call is a data-passing bottleneck and you can probably do better in every single case, but that is probably only worth it in hotspots – applied everywhere at scale, but then you are paying for it. For example, the Go compiler on Windows generates code that is not compatible with the Windows x64 calling convention. To my knowledge, this was done to be able to efficiently implement their version of coroutines, which are a central feature of the language. The cost of that is that typical stack-unwinding code stops working for Go-code, and entirely classes of tools stop working (e.g. anything based on Event-Tracing for Windows (ETW)). The C# runtime CoreCLR is also side-stepping the default ABI (on ARM64) for commonly used writes barriers.