Hey! You can find me on Mastodon and Bluesky!

Continuing from last week’s post on what x86 assembly the C# compiler produces for loops, this post will take a quick look at how said compiler deals with return value optimizations for structs. As before, I am using SharpLab to look at the x86-32 assembly output of the JIT compiler (in Release mode). As a side note, for all I know SharpLab is still using the legacy .NET jit compiler, not RyuJIT.

Return Value Optimization

The return value optimization is originally a term from the C++ community and refers to the situation where a potentially large return value from a function should not be copied to the callsite, but rather directly constructed there. Let’s illuminate this with a quick example from C#:

struct LargeStruct { int a, b, c, d; }
public static LargeStruct Make() {
  return new LargeStruct();
}

public static void Use() {
  // We make a potential copy of the temporary value produced by
  // Make on the rhs here and store that copy on the lhs 
  LargeStruct ls = Make();
}

LargeStruct is, well, a large struct. Its size is much too large to be passed in registers, so there needs to be another way to return the constructed instance to the caller Use of Make. Usually, this is implemented as follows (morally):

public static void Make(out LargeStruct ls) {
  ls = new LargeStruct()
}

public static void Use() {
  LargeStruct ls;
  Use(out ls);
}

This is essentially what return value optimization is about: Instead of constructing an object, returning it, and copying it to the final target location, the object should be constructed in the final location to begin with. In C++, this is a rather important optimization because copying of objects could trigger relatively expensive operations like memory allocations (e.g., copying a vector will copy the whole array it is backed by). In C#, it is certainly less crucial but could still lead to more efficient code¹.

RVO for structs of different sizes

Let’s look at the ASM produced for different struct sizes and whether the JIT compiler actually performs RVO. You can follow the code here.

The structs considered have the form

struct Xn { public int a_1, /* ... */ a_n; }

with X0 denoting a member-less struct (which for aliasing reasons has size 1, not 0). The function under consideration is

public static Xn Mn() {
  return new Xn();
}

For n = 0, the output is already … adventurous:

C.M0()
  L0000: push eax              ; this is better thought of as sub ESP, 4
  L0001: xor eax, eax          ; set EAX to zero
  L0003: mov [esp], eax        ; put it on the stack
  L0006: lea eax, [esp]        ; load the address of what we just put onto the stack
  L0009: mov byte [eax], 0x0   ; set the lowest byte to zero
  L000c: movsx eax, byte [esp] ; ...and move it back into EAX.
  L0010: pop ecx               ; free stack space
  L0011: ret

Because the struct still fits in a register, the compiler opts to return it in EAX, the default register for returning non-floating point values. Since the size of the struct is technically just a single byte, the compiler also makes sure to only set the lowest byte to zero. This is the sane part of these few instructions. Everything else is absolute insanity: The code above is, for all intents and purposes, equivalent to xor eax, eax; ret (given that ECX is volatile, the caller should not care about its value after the function call at all, hence we can ignore it).

When the RVO is applied manually, everything works out just as you would imagine for n = 0:

public static void M0(ref X0 x) {
  x = new X0();
}

C.M0(X0 ByRef)
  L0000: mov byte [ecx], 0x0 // note that the first argument is passed in ECX
  L0003: ret

For n = 1, the generated assembly code for M1 is as expected:

C.M1()
  L0000: xor eax, eax
  L0002: ret

For 2 <= n <= 4, the generated code looks like this (here for n = 4):

C.M4()
  L0000: push edi           ; EDI and ESI need to be saved
  L0001: push esi
  L0002: xor eax, eax
  L0004: xor edx, edx
  L0006: xor esi, esi
  L0008: xor edi, edi
  L000a: mov [ecx], eax     ; Sets the struct to zero
  L000c: mov [ecx+0x4], edx
  L000f: mov [ecx+0x8], esi
  L0012: mov [ecx+0xc], edi
  L0015: pop esi
  L0016: pop edi
  L0017: ret

For these values, the compiler applies RVO as you would hope: The function receives a reference as its first argument (in ECX) and the method constructs the value there immediately. Still, the way it is using the registers is suboptimal, to say the least.

Manually switching to by-reference makes the code much more compact:

C.R4(X4 ByRef)
  L0000: push edi
  L0001: mov edi, ecx
  L0003: xorps xmm0, xmm0     ; this is essentially XOR xmm0, xmm0
  L0006: movq [edi], xmm0     ; move a quadword = 64bits, thus zeroing
  L000a: movq [edi+0x8], xmm0 ; all bits of the struct
  L000f: pop edi
  L0010: ret

This is using an 64bit SSE to quickly set the struct to all zeroes. Neat! (I am not sure why it chose to move ECX to EDI beforehand; it is unnecessary here.)

On to n = 5. It is here where things break down:

C.M5()
  L0000: push edi
  L0001: push esi
  L0002: sub esp, 0x14        ; (1): allocate space for the struct on the stack
                              ;      and set it to zero (see next comment)
  L0005: mov esi, ecx
  L0007: lea edi, [esp]
  L000a: mov ecx, 0x5
  L000f: xor eax, eax
  L0011: rep stosd            ; set all 5 fields to zero, equivalent to
                              ; mov [edi], eax
                              ; add edi, 0x4
                              ; dec edx
  L0013: mov ecx, esi         
  L0015: mov edx, ecx         ; (2): set the same (!) memory to zero AGAIN
  L0017: lea edi, [esp]       ;      using SSE registers+stosd
  L001a: xor eax, eax
  L001c: xorps xmm0, xmm0     
  L001f: movq [edi], xmm0
  L0023: movq [edi+0x8], xmm0
  L0028: add edi, 0x10
  L002b: stosd
  L002c: mov edi, edx
  L002e: lea esi, [esp]       ; (3): copy zeroed out memory to output location
  L0031: movq xmm0, [esi]     ; copy fields 1, 2
  L0035: movq [edi], xmm0
  L0039: movq xmm0, [esi+0x8] ; copy fields 3, 4
  L003e: movq [edi+0x8], xmm0
  L0043: add esi, 0x10
  L0046: add edi, 0x10
  L0049: movsd                ; copy last field
  L004a: add esp, 0x14
  L004d: pop esi
  L004e: pop edi
  L004f: ret

Ok, this definitely isn’t using RVO anymore. First (1), the struct-instance is created on the local stack space and set to zero. Then (2) it is set to zero again. This happens most likely because memory is initialized to all zeroes by default in .NET, and the default constructor of the struct doesn’t do any more than setting the memory to zero again (if you call a non-default ctor and look at the ASM output, you will see the zero initialization followed by a ctor call). This redundancy is a bit silly, because struct ctors in C# are required to explicitly set all fields of the struct anyway². Finally (3), the initialized instance is moved to the output location in the caller’s stack frame. Unfortunately, all of this also happens when M5 is inlined (see function F5 in the SharpLab entry linked above). I find it funny that the compiler first uses a rep stosd (repeat store string double word) to set everything to zero, then does a clever thing using SSE registers only to fallback to stosd again for the last field.

For comparison, here is the by-reference version of this call:

C.R5(X5 ByRef)
  L0000: push edi
  L0001: mov edi, ecx
  L0003: xor eax, eax
  L0005: xorps xmm0, xmm0
  L0008: movq [edi], xmm0
  L000c: movq [edi+0x8], xmm0
  L0011: add edi, 0x10
  L0014: stosd
  L0015: pop edi
  L0016: ret

This produces the following more pleasant result³:

public static void M5User() {
  X5 x;
  R5(out x);
}

C.M5User()
  L0000: push edi
  L0001: sub esp, 0x14
  L0004: lea edi, [esp]
  L0007: mov ecx, 0x5
  L000c: xor eax, eax
  L000e: rep stosd
  L0010: lea ecx, [esp]
  L0013: call C.R5(X5 ByRef)
  L0018: add esp, 0x14
  L001b: pop edi
  L001c: ret

For n > 5, the situation doesn’t really change all that much. From n > 7, the SSE registers aren’t used for moving anymore (but the initialization is still using them), and from n = 11 and beyond moving and initialization is implemented using stosd/movsd exclusively.

So, what to make from all of this? Thankfully, the compiler is optimized for the common cases (1 <= n <= 4). Larger structs are for some reason not constructed in place but copied around unnecessarily, possibly because the compiler fails to eliminate the memory initialization before the constructor call. Empty structs are probably used some times, but I won’t blame anyone for not considering that case to be important enough to be optimized. What I found surprising is that for n >= 5 even inlining does not eliminate the copying that RVO should prevent.

Please note that these results do not apply to the outdated Mono C# compiler used in Unity; that one is another thing entirely. I have looked at some of the assembly produced by that and it is not pretty either; I might write about that next week :)

Whether the speed-up you could get from applying RVO to C# code is relevant is of course an entirely different matter. ↩
C# is of course not the only language that is compiled to MSIL, so the argument that C# requires you to set all fields in a struct ctor doesn’t really apply, unfortunately, because other languages could still do all sorts of things. ↩
In this case, inlining of M5 was specifically prevented. In SharpLab.io, this is accomplished by writing the caller below the callee. ↩

C# compilers, RVO, and struct initialization

A Case Study

Return Value Optimization

RVO for structs of different sizes