Continuing from last week’s post on what x86 assembly the C# compiler produces for loops, this post will take a quick look at how said compiler deals with return value optimizations for structs. As before, I am using SharpLab to look at the x86-32 assembly output of the JIT compiler (in Release mode). As a side note, for all I know SharpLab is still using the legacy .NET jit compiler, not RyuJIT.
Return Value Optimization
The return value optimization is originally a term from the C++ community and refers to the situation where a potentially large return value from a function should not be copied to the callsite, but rather directly constructed there. Let’s illuminate this with a quick example from C#:
struct LargeStruct { int a, b, c, d; }
public static LargeStruct Make() {
return new LargeStruct();
}
public static void Use() {
// We make a potential copy of the temporary value produced by
// Make on the rhs here and store that copy on the lhs
LargeStruct ls = Make();
}
LargeStruct
is, well, a large struct. Its size is much too large to be passed in registers, so there needs to be another way to return the constructed instance to the caller Use
of Make
. Usually, this is implemented as follows (morally):
public static void Make(out LargeStruct ls) {
ls = new LargeStruct()
}
public static void Use() {
LargeStruct ls;
Use(out ls);
}
This is essentially what return value optimization is about: Instead of constructing an object, returning it, and copying it to the final target location, the object should be constructed in the final location to begin with. In C++, this is a rather important optimization because copying of objects could trigger relatively expensive operations like memory allocations (e.g., copying a vector will copy the whole array it is backed by). In C#, it is certainly less crucial but could still lead to more efficient code1.
RVO for structs of different sizes
Let’s look at the ASM produced for different struct sizes and whether the JIT compiler actually performs RVO. You can follow the code here.
The structs considered have the form
struct Xn { public int a_1, /* ... */ a_n; }
with X0
denoting a member-less struct (which for aliasing reasons has size 1, not 0). The function under consideration is
public static Xn Mn() {
return new Xn();
}
For n = 0
, the output is already … adventurous:
C.M0()
L0000: push eax ; this is better thought of as sub ESP, 4
L0001: xor eax, eax ; set EAX to zero
L0003: mov [esp], eax ; put it on the stack
L0006: lea eax, [esp] ; load the address of what we just put onto the stack
L0009: mov byte [eax], 0x0 ; set the lowest byte to zero
L000c: movsx eax, byte [esp] ; ...and move it back into EAX.
L0010: pop ecx ; free stack space
L0011: ret
Because the struct still fits in a register, the compiler opts to return it in EAX, the default register for returning non-floating point values. Since the size of the struct is technically just a single byte, the compiler also makes sure to only set the lowest byte to zero. This is the sane part of these few instructions. Everything else is absolute insanity: The code above is, for all intents and purposes, equivalent to xor eax, eax; ret
(given that ECX is volatile, the caller should not care about its value after the function call at all, hence we can ignore it).
When the RVO is applied manually, everything works out just as you would imagine for n = 0
:
public static void M0(ref X0 x) {
x = new X0();
}
C.M0(X0 ByRef)
L0000: mov byte [ecx], 0x0 // note that the first argument is passed in ECX
L0003: ret
For n = 1
, the generated assembly code for M1
is as expected:
C.M1()
L0000: xor eax, eax
L0002: ret
For 2 <= n <= 4
, the generated code looks like this (here for n = 4
):
C.M4()
L0000: push edi ; EDI and ESI need to be saved
L0001: push esi
L0002: xor eax, eax
L0004: xor edx, edx
L0006: xor esi, esi
L0008: xor edi, edi
L000a: mov [ecx], eax ; Sets the struct to zero
L000c: mov [ecx+0x4], edx
L000f: mov [ecx+0x8], esi
L0012: mov [ecx+0xc], edi
L0015: pop esi
L0016: pop edi
L0017: ret
For these values, the compiler applies RVO as you would hope: The function receives a reference as its first argument (in ECX) and the method constructs the value there immediately. Still, the way it is using the registers is suboptimal, to say the least.
Manually switching to by-reference makes the code much more compact:
C.R4(X4 ByRef)
L0000: push edi
L0001: mov edi, ecx
L0003: xorps xmm0, xmm0 ; this is essentially XOR xmm0, xmm0
L0006: movq [edi], xmm0 ; move a quadword = 64bits, thus zeroing
L000a: movq [edi+0x8], xmm0 ; all bits of the struct
L000f: pop edi
L0010: ret
This is using an 64bit SSE to quickly set the struct to all zeroes. Neat! (I am not sure why it chose to move ECX to EDI beforehand; it is unnecessary here.)
On to n = 5
. It is here where things break down:
C.M5()
L0000: push edi
L0001: push esi
L0002: sub esp, 0x14 ; (1): allocate space for the struct on the stack
; and set it to zero (see next comment)
L0005: mov esi, ecx
L0007: lea edi, [esp]
L000a: mov ecx, 0x5
L000f: xor eax, eax
L0011: rep stosd ; set all 5 fields to zero, equivalent to
; mov [edi], eax
; add edi, 0x4
; dec edx
L0013: mov ecx, esi
L0015: mov edx, ecx ; (2): set the same (!) memory to zero AGAIN
L0017: lea edi, [esp] ; using SSE registers+stosd
L001a: xor eax, eax
L001c: xorps xmm0, xmm0
L001f: movq [edi], xmm0
L0023: movq [edi+0x8], xmm0
L0028: add edi, 0x10
L002b: stosd
L002c: mov edi, edx
L002e: lea esi, [esp] ; (3): copy zeroed out memory to output location
L0031: movq xmm0, [esi] ; copy fields 1, 2
L0035: movq [edi], xmm0
L0039: movq xmm0, [esi+0x8] ; copy fields 3, 4
L003e: movq [edi+0x8], xmm0
L0043: add esi, 0x10
L0046: add edi, 0x10
L0049: movsd ; copy last field
L004a: add esp, 0x14
L004d: pop esi
L004e: pop edi
L004f: ret
Ok, this definitely isn’t using RVO anymore. First (1), the struct-instance is created on the local stack space and set to zero. Then (2) it is set to zero again. This happens most likely because memory is initialized to all zeroes by default in .NET, and the default constructor of the struct doesn’t do any more than setting the memory to zero again (if you call a non-default ctor and look at the ASM output, you will see the zero initialization followed by a ctor call). This redundancy is a bit silly, because struct ctors in C# are required to explicitly set all fields of the struct anyway2. Finally (3), the initialized instance is moved to the output location in the caller’s stack frame. Unfortunately, all of this also happens when M5
is inlined (see function F5
in the SharpLab entry linked above).
I find it funny that the compiler first uses a rep stosd
(repeat store string double word) to set everything to zero, then does a clever thing using SSE registers only to fallback to stosd
again for the last field.
For comparison, here is the by-reference version of this call:
C.R5(X5 ByRef)
L0000: push edi
L0001: mov edi, ecx
L0003: xor eax, eax
L0005: xorps xmm0, xmm0
L0008: movq [edi], xmm0
L000c: movq [edi+0x8], xmm0
L0011: add edi, 0x10
L0014: stosd
L0015: pop edi
L0016: ret
This produces the following more pleasant result3:
public static void M5User() {
X5 x;
R5(out x);
}
C.M5User()
L0000: push edi
L0001: sub esp, 0x14
L0004: lea edi, [esp]
L0007: mov ecx, 0x5
L000c: xor eax, eax
L000e: rep stosd
L0010: lea ecx, [esp]
L0013: call C.R5(X5 ByRef)
L0018: add esp, 0x14
L001b: pop edi
L001c: ret
For n > 5
, the situation doesn’t really change all that much. From n > 7
, the SSE registers aren’t used for moving anymore (but the initialization is still using them), and from n = 11
and beyond moving and initialization is implemented using stosd/movsd
exclusively.
So, what to make from all of this? Thankfully, the compiler is optimized for the common cases (1 <= n <= 4
). Larger structs are for some reason not constructed in place but copied around unnecessarily, possibly because the compiler fails to eliminate the memory initialization before the constructor call. Empty structs are probably used some times, but I won’t blame anyone for not considering that case to be important enough to be optimized. What I found surprising is that for n >= 5
even inlining does not eliminate the copying that RVO should prevent.
Please note that these results do not apply to the outdated Mono C# compiler used in Unity; that one is another thing entirely. I have looked at some of the assembly produced by that and it is not pretty either; I might write about that next week :)
-
Whether the speed-up you could get from applying RVO to C# code is relevant is of course an entirely different matter. ↩
-
C# is of course not the only language that is compiled to MSIL, so the argument that C# requires you to set all fields in a struct ctor doesn’t really apply, unfortunately, because other languages could still do all sorts of things. ↩
-
In this case, inlining of
M5
was specifically prevented. In SharpLab.io, this is accomplished by writing the caller below the callee. ↩