Part of the series: Performance Hacking .NET
How many times have you written a foreach
statement? Probably too many times to count. How many times have you worried about the performance of that statement? Probably never.
That. Changes. Now!
Behold these benchmarks:
[MemoryDiagnoser] public class EnumerationBenchmarks { private List<int> _List; private IEnumerable<int> _IEnumerable; [Params(10)] public int NumberOfItems { get; set; } [Params(1000, 5000)] public int NumberOfEnumerations { get; set; } [GlobalSetup] public void GlobalSetup() { _List = new List<int>(NumberOfItems); _IEnumerable = _List; for (int i = 0; i < NumberOfItems; i++) { _List.Add(i); } } [Benchmark] public long ForLoop() { long Total = 0; for (int c = 0; c < NumberOfEnumerations; c++) { for (int i = 0; i < _List.Count; i++) { Total += _List[i]; } } return Total; } [Benchmark] public long EnumerateList() { long Total = 0; for (int c = 0; c < NumberOfEnumerations; c++) { foreach (int i in _List) { Total += i; } } return Total; } [Benchmark] public long EnumerateIEnumerable() { long Total = 0; for (int c = 0; c < NumberOfEnumerations; c++) { foreach (int i in _IEnumerable) { Total += i; } } return Total; } }
See anything in there that would scream performance issue? Look carefully, I dare you.
Here’s the results:
Method | NumberOfItems | NumberOfEnumerations | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|
ForLoop | 10 | 1000 | 10.37 us | 0.136 us | 0.121 us | – | – | – | – |
EnumerateList | 10 | 1000 | 20.51 us | 0.174 us | 0.154 us | – | – | – | – |
EnumerateIEnumerable | 10 | 1000 | 88.02 us | 0.815 us | 0.762 us | 4.7607 | – | – | 40001 B |
ForLoop | 10 | 5000 | 52.62 us | 1.063 us | 1.138 us | – | – | – | – |
EnumerateList | 10 | 5000 | 115.18 us | 0.939 us | 0.833 us | – | – | – | 1 B |
EnumerateIEnumerable | 10 | 5000 | 449.49 us | 8.793 us | 13.689 us | 23.4375 | – | – | 200000 B |
[Record scratch sound effect here.] Hang on, what the heck?! That’s a big difference. Shocking, even.
Allow me to explain.
Struct vs Class
The for
looping pattern has existed for a long time (since the dawn of programming?), and for good reason. It is really good at its job. When you write one, you are in complete control of the code being executed. It’s zero allocation, you can exit the loop at any time, and you can even jump around (if you were so inclined). But we C# developers, we don’t like a lot of ceremony. There’s just too much to type in a for
loop. There’s other thorny things too like you need to know the length of what you want to loop over, and need a way to get each item by its index. To avoid these pesky challenges, and to reduce the dreaded ceremony, we have a cheat feature… the foreach
loop.
I said in this series that .NET is a productivity platform, and the foreach
is a great example of that. It’s easy, and you don’t have to worry about some of the details. But there are things going on under the hood to power that magic.
Let’s rewrite the two benchmarks using the foreach
loops into what the compiler will see:
[Benchmark] public long EnumerateList() { long Total = 0; for (int c = 0; c < NumberOfEnumerations; c++) { List<int>.Enumerator enumerator = _List.GetEnumerator(); while (enumerator.MoveNext()) { Total += enumerator.Current; } } return Total; } [Benchmark] public long EnumerateIEnumerable() { long Total = 0; for (int c = 0; c < NumberOfEnumerations; c++) { IEnumerator<int> enumerator = _IEnumerable.GetEnumerator(); while (enumerator.MoveNext()) { Total += enumerator.Current; } } return Total; }
See the issue? It’s easier to see now, but still very hidden.
List<T>
exposes three (!) different ways to get an enumerator:
public Enumerator GetEnumerator() => new Enumerator(this); IEnumerator<T> IEnumerable<T>.GetEnumerator() => new Enumerator(this); IEnumerator IEnumerable.GetEnumerator() => new Enumerator(this);
That Enumerator
being returned is implemented as a struct
inside of .NET:
public struct Enumerator : IEnumerator<T>, IEnumerator { private readonly List<T> _list; private int _index; private readonly int _version; [AllowNull, MaybeNull] private T _current; internal Enumerator(List<T> list) { _list = list; _index = 0; _version = list._version; _current = default; }
See it now?
It’s a tricky one.
When we do foreach
on a List<T>
we use that struct
List<T>.Enumerator
. Because it is a struct
, it lives on the stack, no allocation necessary. When we do a foreach
on IList<T>
or IEnumerable<T>
, we end up with an IEnumerator<T>
. IEnumerator<T>
is a reference type. What that means is struct List<T>.Enumerator
has to be boxed onto the heap so we can refer to it by reference. That’s why we are paying for an allocation.
[GetEnumerator
is actually pretty magical. Check out Eric Lippert’s classic post about how pattern matching is used to find the method foreach
will bind to.]
If we look at the fields of List<T>.Enumerator
we see it has a pointer to the List<T>
being enumerated, that’s going to be 4 or 8 bytes depending on 32 bit or 64 bit processor. It has int _index
and int _version
that’s 8 bytes. And it has T _current
which will depend on the type of the items contained in the list. Ours is a list of int
, so another 4 bytes. On a 64 bit processor, we’re looking at 20 bytes being allocated?
This allocation, my friends, is a tragedy. No one would ever implement an allocation when they could do a simple for
loop! But we do, because foreach
is easy.
You are probably paying this penalty all over the place without even knowing it. It’s common practice to pass around IEnumerable<T>
or IList<T>
instead of List<T>
, for really no other reason than specifying the least-specific type opens up more avenues for future change. If we passed around the concrete types (List<T>
, Dictionary<TKey, TValue>
, etc.) everywhere our code could actually loop more efficiently.
The takeaway from this whole thing is this: If your hot-path is going to be doing a foreach
, watch out, there be dragons.
Is there anything we can do about this?
I do have an extreme solution to this problem. Your first choice should be to do a for
or foreach
against the concrete type (List<T>
). But if you can’t, because you don’t control the API, or whatever, we can use a DynamicMethod to call the underlying GetEnumerator
that returns the struct
.
private static AllocationFreeForEachDelegate BuildAllocationFreeForEachDelegate(Type enumerableType) { Type itemCallbackType = typeof(StructEnumeratorForEachDelegate<TItem, TState>); MethodInfo? getEnumeratorMethod = ResolveGetEnumeratorMethodForType(enumerableType); if (getEnumeratorMethod == null) { // Fallback to allocation mode and use IEnumerable<TItem>.GetEnumerator. // Primarily for Array.Empty and Enumerable.Empty case, but also for user types. getEnumeratorMethod = s_GenericGetEnumeratorMethod; } Type enumeratorType = getEnumeratorMethod.ReturnType; bool isDisposable = typeof(IDisposable).IsAssignableFrom(enumeratorType); DynamicMethod dynamicMethod = new DynamicMethod( nameof(AllocationFreeForEach), null, new[] { typeof(TEnumerable), typeof(TState).MakeByRefType(), itemCallbackType }, typeof(AllocationFreeForEachDelegate).Module, skipVisibility: true); ILGenerator generator = dynamicMethod.GetILGenerator(); generator.DeclareLocal(enumeratorType); Label beginLoopLabel = generator.DefineLabel(); Label processCurrentLabel = generator.DefineLabel(); Label returnLabel = generator.DefineLabel(); generator.Emit(OpCodes.Ldarg_0); generator.Emit(OpCodes.Callvirt, getEnumeratorMethod); generator.Emit(OpCodes.Stloc_0); // try if (isDisposable) { generator.BeginExceptionBlock(); } generator.Emit(OpCodes.Br_S, beginLoopLabel); generator.MarkLabel(processCurrentLabel); generator.Emit(OpCodes.Ldarg_2); generator.Emit(OpCodes.Ldarg_1); generator.Emit(OpCodes.Ldloca_S, 0); generator.Emit(OpCodes.Constrained, enumeratorType); generator.Emit(OpCodes.Callvirt, s_GeneircCurrentGetMethod); generator.Emit(OpCodes.Callvirt, itemCallbackType.GetMethod("Invoke")); generator.Emit(OpCodes.Brtrue_S, beginLoopLabel); if (isDisposable) generator.Emit(OpCodes.Leave_S, returnLabel); else generator.Emit(OpCodes.Br_S, returnLabel); generator.MarkLabel(beginLoopLabel); generator.Emit(OpCodes.Ldloca_S, 0); generator.Emit(OpCodes.Constrained, enumeratorType); generator.Emit(OpCodes.Callvirt, s_MoveNextMethod); generator.Emit(OpCodes.Brtrue_S, processCurrentLabel); if (isDisposable) { generator.Emit(OpCodes.Leave_S, returnLabel); // finally generator.BeginFinallyBlock(); { generator.Emit(OpCodes.Ldloca_S, 0); generator.Emit(OpCodes.Constrained, enumeratorType); generator.Emit(OpCodes.Callvirt, s_DisposeMethod); } generator.EndExceptionBlock(); } generator.MarkLabel(returnLabel); generator.Emit(OpCodes.Ret); return (AllocationFreeForEachDelegate)dynamicMethod.CreateDelegate(typeof(AllocationFreeForEachDelegate)); }
That will allow you to do your loop like this:
[Benchmark] public long EnumerateStructEnumerable() { long Total = 0; for (int c = 0; c < NumberOfEnumerations; c++) { ListStructEnumerator<int, long>.AllocationFreeForEach( _IEnumerable, ref Total, s_EnumerateChildListOfStructsAsStructEnumerableForEachRef); } return Total; } private static readonly StructEnumeratorForEachDelegate<int, long> s_EnumerateChildListOfStructsAsStructEnumerableForEachRef = EnumerateChildListOfStructsAsStructEnumerableForEach; private static bool EnumerateChildListOfStructsAsStructEnumerableForEach(ref long total, int value) { total += value; return true; }
I have the code up on GitHub for this if you are interested. I haven’t pushed a NuGet yet for it because I’m not sure about the usefulness, generally speaking. There are cases where it is useful, absolutely, but is it widely useful? And is the API easy enough? Not sure yet!