Hey! You can find me on Mastodon, Bluesky, or Twitter!
I was convinced I had a post somewhere where I compare Mono’s codegen for C# as used in Unity between Release and Debug. I was looking for it, so I could send it to a friend, but did not find it. Let’s fill that gap.
As we have established before Mono is not exactly optimized for competitive codegen, because Mono’s main job is compatibility. That last post linked there shows how Mono in particular struggles with value types, even when set to Release. Using a package that I created a long time ago (find ASM Explorer here), you can look at the generated assembly quite easily.
When you make a build with Unity, you can either use Mono or IL2CPP, which will translate the IL into C++. However, the Unity editor is always running on Mono, and it is very common to run in Debug (for example because you want to debug your C# code in the editor). Unity will always put all C# code into Debug, not just specific pieces of code.
For today, we are going to look at this function:
public static class Adder
{
public static int Add(int a, int b)
{
return a + b;
}
}
Let us take a look at the generated code in Release mode:
000001846cde4070 48 83 ec 18 sub rsp, 0x18
000001846cde4074 48 89 0c 24 mov [rsp], rcx
000001846cde4078 48 89 54 24 08 mov [rsp+0x8], rdx
000001846cde407d 48 8b c1 mov rax, rcx
000001846cde4080 03 44 24 08 add eax, [rsp+0x8]
000001846cde4084 48 83 c4 18 add rsp, 0x18
000001846cde4088 c3 ret
This is already not great. The arguments “spill” on the stack, which might be a consequence of compiling IL opcodes one-by-one (IL is entirely stack based). But the core of the function is at least visible: it is adding two numbers.
Here is the code when compiling this same code in Debug. Warning, explicit content ahead:
00000185f4a46b40 48 83 ec 28 sub rsp, 0x28
00000185f4a46b44 4c 89 3c 24 mov [rsp], r15
00000185f4a46b48 48 89 4c 24 18 mov [rsp+0x18], rcx
00000185f4a46b4d 48 89 54 24 20 mov [rsp+0x20], rdx
00000185f4a46b52 49 bb d8 54 18 6d fb 7f 00 00 mov r11, 0x7ffb6d1854d8
00000185f4a46b5c 4c 89 5c 24 10 mov [rsp+0x10], r11 ; write breakpoint trampoline
00000185f4a46b61 49 bb 30 54 18 6d fb 7f 00 00 mov r11, 0x7ffb6d185430
00000185f4a46b6b 4c 89 5c 24 08 mov [rsp+0x8], r11 ; write singlestep trampoline
00000185f4a46b70 45 33 ff xor r15d, r15d
00000185f4a46b73 41 bb 00 00 00 00 mov r11d, 0x0
00000185f4a46b79 4d 85 db test r11, r11
00000185f4a46b7c 74 08 jz 0xf4a46b86
00000185f4a46b7e 4c 8b 5c 24 08 mov r11, [rsp+0x8] ; read singlestep trampoline
00000185f4a46b83 41 ff 13 call qword [r11] ; check for singlestep
00000185f4a46b86 90 nop
00000185f4a46b87 4c 8b 5c 24 10 mov r11, [rsp+0x10] ; read breakpoint trampoline
00000185f4a46b8c 4d 8b 1b mov r11, [r11]
00000185f4a46b8f 4d 85 db test r11, r11
00000185f4a46b92 74 03 jz 0xf4a46b97
00000185f4a46b94 41 ff d3 call r11
00000185f4a46b97 41 bb 00 00 00 00 mov r11d, 0x0
00000185f4a46b9d 4d 85 db test r11, r11
00000185f4a46ba0 74 08 jz 0xf4a46baa
00000185f4a46ba2 4c 8b 5c 24 08 mov r11, [rsp+0x8] ; read singlestep trampoline
00000185f4a46ba7 41 ff 13 call qword [r11] ; check for singlestep
00000185f4a46baa 90 nop
00000185f4a46bab 41 bb 00 00 00 00 mov r11d, 0x0
00000185f4a46bb1 4d 85 db test r11, r11
00000185f4a46bb4 74 08 jz 0xf4a46bbe
00000185f4a46bb6 4c 8b 5c 24 08 mov r11, [rsp+0x8] ; read singlestep trampoline
00000185f4a46bbb 41 ff 13 call qword [r11] ; check for singlestep
00000185f4a46bbe 90 nop
00000185f4a46bbf 48 63 44 24 18 movsxd rax, dword [rsp+0x18]
00000185f4a46bc4 48 63 4c 24 20 movsxd rcx, dword [rsp+0x20]
00000185f4a46bc9 03 c1 add eax, ecx
00000185f4a46bcb 4c 8b f8 mov r15, rax
00000185f4a46bce 41 bb 00 00 00 00 mov r11d, 0x0
00000185f4a46bd4 4d 85 db test r11, r11
00000185f4a46bd7 74 08 jz 0xf4a46be1
00000185f4a46bd9 4c 8b 5c 24 08 mov r11, [rsp+0x8] ; read singlestep trampoline
00000185f4a46bde 41 ff 13 call qword [r11] ; check for singlestep
00000185f4a46be1 90 nop
00000185f4a46be2 49 8b c7 mov rax, r15
00000185f4a46be5 41 bb 00 00 00 00 mov r11d, 0x0
00000185f4a46beb 4d 85 db test r11, r11
00000185f4a46bee 74 08 jz 0xf4a46bf8
00000185f4a46bf0 4c 8b 5c 24 08 mov r11, [rsp+0x8] ; read singlestep trampoline
00000185f4a46bf5 41 ff 13 call qword [r11] ; check for singlestep
00000185f4a46bf8 90 nop
00000185f4a46bf9 4c 8b 3c 24 mov r15, [rsp]
00000185f4a46bfd 48 83 c4 28 add rsp, 0x28
00000185f4a46c01 c3 ret
Wow, what happened here? Can you still find the actual work this function is performing?
Here is what is happening, as far as I understand it: In contrast to a native debugger, you do not debug Mono by carefully placing int 3
breakpoints. No, it requires a litte more cooperation. At every sequence point in the function (so basically between every IL opcode, and at function start and end), we check for whether there is a breakpoint there by checking R11 (test r11, r11
) and then invoking special debug behavior through a function pointer. At the very start of the function, those two function pointers are set up. Then during execution the debugger would presumably patch the 4 zero bytes in the mov r11d, 0x0
instruction preceeding the R11 check to indicate that it should break. There is an additional check for single-stepping. The labelling above and in ASM Explorer
might be wrong: it is mixing up the singlestepping trampoline with the breakpoint trampoline. But in the big picture, this is irrelevant.
Most of this debug code above will never execute, but that does not mean it is free: it takes active effort to ignore the code, there are tons of new jumps that will mess with branch prediction, etc. Clearly, even if “number of instructions” is a very bad proxy for performance, it should be agreeable that this generated code above is not great. – Regardless, you can measure this impact. Release builds are just much faster across the board.
For completeness, let’s look at an optimized IL2CPP build as well. The code looks like this:
IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR int32_t Adder_Add_mCA6F2287A5D89D3050A3932750CB8CC867E0A172 (int32_t ___0_a, int32_t ___1_b, const RuntimeMethod* method)
{
{
int32_t L_0 = ___0_a;
int32_t L_1 = ___1_b;
return ((int32_t)il2cpp_codegen_add(L_0, L_1));
}
}
with
template<typename T, typename U>
inline typename pick_bigger<T, U>::type il2cpp_codegen_add(T left, U right)
{
return left + right;
}
and the compiled output is the very sensible
lea eax, [rcx, rdx]
retn
For completeness sake, let’s also look at what modern dotnet on CoreCLR does to this and what MSVC is doing without optimizations. I am using SharpLab to look at the results. In Release, you get this with CoreCLR:
Adder.Add(Int32, Int32)
L0000: lea eax, [rcx+rdx]
L0003: ret
Debug is already quite a bit more verbose. Note that we again seem to be checking for some global flag.
Adder.Add(Int32, Int32)
L0000: push rbp
L0001: sub rsp, 0x30
L0005: lea rbp, [rsp+0x30]
L000a: xor eax, eax
L000c: mov [rbp-4], eax
L000f: mov [rbp+0x10], ecx
L0012: mov [rbp+0x18], edx
L0015: mov rax, 0x7ffd2f90c258
L001f: cmp dword ptr [rax], 0
L0022: je short L0029
L0024: call 0x00007ffd7f70a200
L0029: nop
L002a: mov eax, [rbp+0x10]
L002d: add eax, [rbp+0x18]
L0030: mov [rbp-4], eax
L0033: nop
L0034: mov eax, [rbp-4]
L0037: add rsp, 0x30
L003b: pop rbp
L003c: ret
(Addendum: My friend Alexandre Mutel points out that in addition to the better codegen, CoreCLR is also capable of only compiling your code in Debug, while all of the rest of the C# code in the application can enjoy full optimizations.)
Now, finally, MSVC with /Od
:
int Add(int,int) PROC ; Add
mov DWORD PTR [rsp+16], edx
mov DWORD PTR [rsp+8], ecx
mov eax, DWORD PTR b$[rsp]
mov ecx, DWORD PTR a$[rsp]
add ecx, eax
mov eax, ecx
ret 0
In conclusion: For this particular trivial example, the optimized Release version produced by Mono is about the same as the un-optimized code generated by MSVC. The debug version of CoreCLR is still relying on checking some globals, but requires fewer such checks than Mono.