C# compilers and looping on arrays

The other day I was optimizing a bit of C# code and at some point I started to use SharpLab to take a look at the assembly that would be the result of compiling the code. Before we get to the meat of this post, note that all experiments were conducted with the default compiler on version 2.3.2 and the Release configuration, though none of the results changed when switching to the current master branch. You can follow the code for the examples here.

At some point, I entered this piece of code:

public class C {
  private int[] _array;
  public int SumForEach() {
    int sum = 0;
    // _array is an array of integers that is local to the class
    foreach (var x in _array) {
      sum += x;
    }
    return sum;
  }
}

It produces the following assembly output:

C.SumForEach()
  L0000: push ebp                 // prolog, register saving
  L0001: mov ebp, esp
  L0003: push edi
  L0004: push esi
  L0005: xor esi, esi             // sum variable
  L0007: mov ecx, [ecx+0x4]       // ecx is the pointer to the array object
  L000a: xor edx, edx             // i = 0
  L000c: mov edi, [ecx+0x4]       // edi is the length
  L000f: test edi, edi
  L0011: jle L001e                // skip if array is empty
  L0013: mov eax, [ecx+edx*4+0x8] // get next element
  L0017: add esi, eax             // do summation
  L0019: inc edx                  // i++
  L001a: cmp edi, edx
  L001c: jg L0013                 // jump back as long as length > i
  L001e: mov eax, esi             // move sum to output register
  L0020: pop esi                  // epilog
  L0021: pop edi
  L0022: pop ebp
  L0023: ret

This is what you would probably expect – just go over the array, sum the contents, return them. Let’s try another method of iteration:

public int SumFor() {
  int sum = 0;
  for (int i = 0; i < _array.Length; i++) {
      sum += _array[i];
  }
  return sum;
}

Its translation is slightly different:

C.SumFor()
  L0000: push ebp                 // prolog
  L0001: mov ebp, esp
  L0003: push edi
  L0004: push esi
  L0005: xor edi, edi             // sum = 0
  L0007: xor esi, esi             // i = 0
  L0009: mov edx, [ecx+0x4]       // get array
  L000c: mov ecx, [edx+0x4]       // get length
  L000f: test ecx, ecx
  L0011: jle L0023
  L0013: mov eax, [edx+0x4]       // get length again
  L0016: cmp esi, eax             
  L0018: jae L0029                // check that i is in bounds
  L001a: add edi, [edx+esi*4+0x8] 
  L001e: inc esi                  // i++
  L001f: cmp ecx, esi
  L0021: jg L0016                 // jump back as long as length > i
  L0023: mov eax, edi             // move sum to output register
  L0025: pop esi                  // epilog
  L0026: pop edi
  L0027: pop ebp
  L0028: ret
  L0029: call 0x739f1aa0          // raise out of bounds exception
  L002e: int3                     // trap to debugger

Notably, it contains code to raise an exception. In C#, this would be the same as:

int sum = 0;
int length = _array.Length;
for (int i = 0; i < length; i++) {
  if (i >= _array.Length) {
    throw new ArrayIndexOutOfBoundsException();
  }
  sum += _array[i];
}
return sum;

This means that it has an extra out-of-bounds check in each iteration. I am not saying that this additional branch will have a huge performance impact, because it will be correctly predicted almost all of the time, but it is curious that it turns up at all! In this specific instance, I would have expected that the compiler gets it right immediately, because the array cannot possibly be changed in this method (except for from different threads, but that shouldn’t be a concern since _array is not volatile). Also, making the array member readonly just does not help – it still fails to eliminate the out-of-bounds checks.

Finally, I tried a slight variation using a local variable:

public int SumForLocal() {
  int[] arr = _array;
  int sum = 0;
  for (int i = 0; i < arr.Length; i++) {
      sum += arr[i];
  }
  return sum;
}

This compiles to the shortest version yet:

C.SumForLocal()
  L0000: push ebp
  L0001: mov ebp, esp
  L0003: push esi                 // look Mum, no EDI needed!
  L0004: mov esi, [ecx+0x4]       // get array
  L0007: xor ecx, ecx             // sum = 0
  L0009: xor edx, edx             // i = 0
  L000b: mov eax, [esi+0x4]       // length is constant
  L000e: test eax, eax            
  L0010: jle L001b                // skip if array is empty
  L0012: add ecx, [esi+edx*4+0x8] // sum += arr[i]
  L0016: inc edx                  // i++
  L0017: cmp eax, edx
  L0019: jg L0012
  L001b: mov eax, ecx
  L001d: pop esi
  L001e: pop ebp
  L001f: ret

Finally, the compiler is able to eliminate the superfluous out-of-bounds checks. The resulting code could be shortened by exchanging the roles played by ECX and EAX such that the sum is already stored in EAX and better register placement to completely avoid the stack (whether this would be any faster would need to be profiled, obviously). I was aware that a common optimization in C# is to replace field accesses with cached local variables, but it didn’t occur to me that this is all the more important for arrays as it seems!

At first, I was convinced that these failures must be artifacts of the way SharpLab is using the compiler (e.g. not requesting optimizations etc.) but I checked all of this in Visual Studio and found the same problems. Similar problems are also present in the Mono version used by Unity (2017.3). In fact, I was very surprised that some (to humans) quite obvious if contrived cases are not properly optimized: Try changing the return type of all the example methods to void and remove the return statements. If you guessed that the compiler will now in every case produce an empty loop and just increase its loop counter, you guessed right!

In all honesty, I do not think that much of this makes a huge difference most of the time or that it is often a problem that the compiler does not optimize away empty loops, but it makes me wonder what else it is missing (copy elision is something I’d like to look into). To be fair, JIT compilers have a realtime requirement that most other compilers do not have, but shouldn’t all of the problems above be fixable on an IL level?