Enumerator performance surprises

Part of the series: Performance Hacking .NET

How many times have you written a foreach statement? Probably too many times to count. How many times have you worried about the performance of that statement? Probably never.

That. Changes. Now!

Behold these benchmarks:

[MemoryDiagnoser]
public class EnumerationBenchmarks
{
	private List<int> _List;

	private IEnumerable<int> _IEnumerable;

	[Params(10)]
	public int NumberOfItems { get; set; }

	[Params(1000, 5000)]
	public int NumberOfEnumerations { get; set; }

	[GlobalSetup]
	public void GlobalSetup()
	{
		_List = new List<int>(NumberOfItems);
		_IEnumerable = _List;
		for (int i = 0; i < NumberOfItems; i++)
		{
			_List.Add(i);
		}
	}

	[Benchmark]
	public long ForLoop()
	{
		long Total = 0;

		for (int c = 0; c < NumberOfEnumerations; c++)
		{
			for (int i = 0; i < _List.Count; i++)
			{
				Total += _List[i];
			}
		}

		return Total;
	}

	[Benchmark]
	public long EnumerateList()
	{
		long Total = 0;

		for (int c = 0; c < NumberOfEnumerations; c++)
		{
			foreach (int i in _List)
			{
				Total += i;
			}
		}

		return Total;
	}

	[Benchmark]
	public long EnumerateIEnumerable()
	{
		long Total = 0;

		for (int c = 0; c < NumberOfEnumerations; c++)
		{
			foreach (int i in _IEnumerable)
			{
				Total += i;
			}
		}

		return Total;
	}
}

See anything in there that would scream performance issue? Look carefully, I dare you.

Here’s the results:

Method	NumberOfItems	NumberOfEnumerations	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
ForLoop	10	1000	10.37 us	0.136 us	0.121 us	–	–	–	–
EnumerateList	10	1000	20.51 us	0.174 us	0.154 us	–	–	–	–
EnumerateIEnumerable	10	1000	88.02 us	0.815 us	0.762 us	4.7607	–	–	40001 B
ForLoop	10	5000	52.62 us	1.063 us	1.138 us	–	–	–	–
EnumerateList	10	5000	115.18 us	0.939 us	0.833 us	–	–	–	1 B
EnumerateIEnumerable	10	5000	449.49 us	8.793 us	13.689 us	23.4375	–	–	200000 B

[Record scratch sound effect here.] Hang on, what the heck?! That’s a big difference. Shocking, even.

Allow me to explain.

Struct vs Class

The for looping pattern has existed for a long time (since the dawn of programming?), and for good reason. It is really good at its job. When you write one, you are in complete control of the code being executed. It’s zero allocation, you can exit the loop at any time, and you can even jump around (if you were so inclined). But we C# developers, we don’t like a lot of ceremony. There’s just too much to type in a for loop. There’s other thorny things too like you need to know the length of what you want to loop over, and need a way to get each item by its index. To avoid these pesky challenges, and to reduce the dreaded ceremony, we have a ~~cheat~~ feature… the foreach loop.

I said in this series that .NET is a productivity platform, and the foreach is a great example of that. It’s easy, and you don’t have to worry about some of the details. But there are things going on under the hood to power that magic.

Let’s rewrite the two benchmarks using the foreach loops into what the compiler will see:

	[Benchmark]
	public long EnumerateList()
	{
		long Total = 0;

		for (int c = 0; c < NumberOfEnumerations; c++)
		{
			List<int>.Enumerator enumerator = _List.GetEnumerator();
			while (enumerator.MoveNext())
			{
				Total += enumerator.Current;
			}
		}

		return Total;
	}

	[Benchmark]
	public long EnumerateIEnumerable()
	{
		long Total = 0;

		for (int c = 0; c < NumberOfEnumerations; c++)
		{
			IEnumerator<int> enumerator = _IEnumerable.GetEnumerator();
			while (enumerator.MoveNext())
			{
				Total += enumerator.Current;
			}
		}

		return Total;
	}

See the issue? It’s easier to see now, but still very hidden.

List<T> exposes three (!) different ways to get an enumerator:

        public Enumerator GetEnumerator()
            => new Enumerator(this);

        IEnumerator<T> IEnumerable<T>.GetEnumerator()
            => new Enumerator(this);

        IEnumerator IEnumerable.GetEnumerator()
            => new Enumerator(this);

That Enumerator being returned is implemented as a struct inside of .NET:

public struct Enumerator : IEnumerator<T>, IEnumerator
{
	private readonly List<T> _list;
	private int _index;
	private readonly int _version;
	[AllowNull, MaybeNull] private T _current;

	internal Enumerator(List<T> list)
	{
		_list = list;
		_index = 0;
		_version = list._version;
		_current = default;
	}

See it now?

It’s a tricky one.

When we do foreach on a List<T> we use that struct List<T>.Enumerator. Because it is a struct, it lives on the stack, no allocation necessary. When we do a foreach on IList<T> or IEnumerable<T>, we end up with an IEnumerator<T>. IEnumerator<T> is a reference type. What that means is struct List<T>.Enumerator has to be boxed onto the heap so we can refer to it by reference. That’s why we are paying for an allocation.

[GetEnumerator is actually pretty magical. Check out Eric Lippert’s classic post about how pattern matching is used to find the method foreach will bind to.]

If we look at the fields of List<T>.Enumerator we see it has a pointer to the List<T> being enumerated, that’s going to be 4 or 8 bytes depending on 32 bit or 64 bit processor. It has int _index and int _version that’s 8 bytes. And it has T _current which will depend on the type of the items contained in the list. Ours is a list of int, so another 4 bytes. On a 64 bit processor, we’re looking at 20 bytes being allocated?

This allocation, my friends, is a tragedy. No one would ever implement an allocation when they could do a simple for loop! But we do, because foreach is easy.

You are probably paying this penalty all over the place without even knowing it. It’s common practice to pass around IEnumerable<T> or IList<T> instead of List<T>, for really no other reason than specifying the least-specific type opens up more avenues for future change. If we passed around the concrete types (List<T>, Dictionary<TKey, TValue>, etc.) everywhere our code could actually loop more efficiently.

The takeaway from this whole thing is this: If your hot-path is going to be doing a foreach, watch out, there be dragons.

Is there anything we can do about this?

I do have an extreme solution to this problem. Your first choice should be to do a for or foreach against the concrete type (List<T>). But if you can’t, because you don’t control the API, or whatever, we can use a DynamicMethod to call the underlying GetEnumerator that returns the struct.

private static AllocationFreeForEachDelegate BuildAllocationFreeForEachDelegate(Type enumerableType)
{
	Type itemCallbackType = typeof(StructEnumeratorForEachDelegate<TItem, TState>);

	MethodInfo? getEnumeratorMethod = ResolveGetEnumeratorMethodForType(enumerableType);
	if (getEnumeratorMethod == null)
	{
		// Fallback to allocation mode and use IEnumerable<TItem>.GetEnumerator.
		// Primarily for Array.Empty and Enumerable.Empty case, but also for user types.
		getEnumeratorMethod = s_GenericGetEnumeratorMethod;
	}

	Type enumeratorType = getEnumeratorMethod.ReturnType;
	bool isDisposable = typeof(IDisposable).IsAssignableFrom(enumeratorType);

	DynamicMethod dynamicMethod = new DynamicMethod(
		nameof(AllocationFreeForEach),
		null,
		new[] { typeof(TEnumerable), typeof(TState).MakeByRefType(), itemCallbackType },
		typeof(AllocationFreeForEachDelegate).Module,
		skipVisibility: true);

	ILGenerator generator = dynamicMethod.GetILGenerator();

	generator.DeclareLocal(enumeratorType);

	Label beginLoopLabel = generator.DefineLabel();
	Label processCurrentLabel = generator.DefineLabel();
	Label returnLabel = generator.DefineLabel();

	generator.Emit(OpCodes.Ldarg_0);
	generator.Emit(OpCodes.Callvirt, getEnumeratorMethod);
	generator.Emit(OpCodes.Stloc_0);

	// try
	if (isDisposable)
	{
		generator.BeginExceptionBlock();
	}

	generator.Emit(OpCodes.Br_S, beginLoopLabel);

	generator.MarkLabel(processCurrentLabel);

	generator.Emit(OpCodes.Ldarg_2);
	generator.Emit(OpCodes.Ldarg_1);
	generator.Emit(OpCodes.Ldloca_S, 0);
	generator.Emit(OpCodes.Constrained, enumeratorType);
	generator.Emit(OpCodes.Callvirt, s_GeneircCurrentGetMethod);

	generator.Emit(OpCodes.Callvirt, itemCallbackType.GetMethod("Invoke"));

	generator.Emit(OpCodes.Brtrue_S, beginLoopLabel);

	if (isDisposable)
		generator.Emit(OpCodes.Leave_S, returnLabel);
	else
		generator.Emit(OpCodes.Br_S, returnLabel);

	generator.MarkLabel(beginLoopLabel);

	generator.Emit(OpCodes.Ldloca_S, 0);
	generator.Emit(OpCodes.Constrained, enumeratorType);
	generator.Emit(OpCodes.Callvirt, s_MoveNextMethod);

	generator.Emit(OpCodes.Brtrue_S, processCurrentLabel);

	if (isDisposable)
	{
		generator.Emit(OpCodes.Leave_S, returnLabel);

		// finally
		generator.BeginFinallyBlock();
		{
			generator.Emit(OpCodes.Ldloca_S, 0);
			generator.Emit(OpCodes.Constrained, enumeratorType);
			generator.Emit(OpCodes.Callvirt, s_DisposeMethod);
		}

		generator.EndExceptionBlock();
	}

	generator.MarkLabel(returnLabel);

	generator.Emit(OpCodes.Ret);

	return (AllocationFreeForEachDelegate)dynamicMethod.CreateDelegate(typeof(AllocationFreeForEachDelegate));
}

That will allow you to do your loop like this:

[Benchmark]
public long EnumerateStructEnumerable()
{
	long Total = 0;

	for (int c = 0; c < NumberOfEnumerations; c++)
	{
		ListStructEnumerator<int, long>.AllocationFreeForEach(
			_IEnumerable,
			ref Total,
			s_EnumerateChildListOfStructsAsStructEnumerableForEachRef);
	}

	return Total;
}

private static readonly StructEnumeratorForEachDelegate<int, long> s_EnumerateChildListOfStructsAsStructEnumerableForEachRef = EnumerateChildListOfStructsAsStructEnumerableForEach;

private static bool EnumerateChildListOfStructsAsStructEnumerableForEach(ref long total, int value)
{
	total += value;

	return true;
}

I have the code up on GitHub for this if you are interested. I haven’t pushed a NuGet yet for it because I’m not sure about the usefulness, generally speaking. There are cases where it is useful, absolutely, but is it widely useful? And is the API easy enough? Not sure yet!

Struct vs Class

Is there anything we can do about this?

Leave a Reply Cancel reply