Enumerator performance surprises

Part of the series: Performance Hacking .NET

How many times have you written a foreach statement? Probably too many times to count. How many times have you worried about the performance of that statement? Probably never.

That. Changes. Now!

Behold these benchmarks:

[MemoryDiagnoser]
public class EnumerationBenchmarks
{
	private List<int> _List;

	private IEnumerable<int> _IEnumerable;

	[Params(10)]
	public int NumberOfItems { get; set; }

	[Params(1000, 5000)]
	public int NumberOfEnumerations { get; set; }

	[GlobalSetup]
	public void GlobalSetup()
	{
		_List = new List<int>(NumberOfItems);
		_IEnumerable = _List;
		for (int i = 0; i < NumberOfItems; i++)
		{
			_List.Add(i);
		}
	}

	[Benchmark]
	public long ForLoop()
	{
		long Total = 0;

		for (int c = 0; c < NumberOfEnumerations; c++)
		{
			for (int i = 0; i < _List.Count; i++)
			{
				Total += _List[i];
			}
		}

		return Total;
	}

	[Benchmark]
	public long EnumerateList()
	{
		long Total = 0;

		for (int c = 0; c < NumberOfEnumerations; c++)
		{
			foreach (int i in _List)
			{
				Total += i;
			}
		}

		return Total;
	}

	[Benchmark]
	public long EnumerateIEnumerable()
	{
		long Total = 0;

		for (int c = 0; c < NumberOfEnumerations; c++)
		{
			foreach (int i in _IEnumerable)
			{
				Total += i;
			}
		}

		return Total;
	}
}

See anything in there that would scream performance issue? Look carefully, I dare you.

Here’s the results:

MethodNumberOfItemsNumberOfEnumerationsMeanErrorStdDevGen 0Gen 1Gen 2Allocated
ForLoop10100010.37 us0.136 us0.121 us
EnumerateList10100020.51 us0.174 us0.154 us
EnumerateIEnumerable10100088.02 us0.815 us0.762 us4.760740001 B
ForLoop10500052.62 us1.063 us1.138 us
EnumerateList105000115.18 us0.939 us0.833 us1 B
EnumerateIEnumerable105000449.49 us8.793 us13.689 us23.4375200000 B

[Record scratch sound effect here.] Hang on, what the heck?! That’s a big difference. Shocking, even.

Allow me to explain.

Struct vs Class

The for looping pattern has existed for a long time (since the dawn of programming?), and for good reason. It is really good at its job. When you write one, you are in complete control of the code being executed. It’s zero allocation, you can exit the loop at any time, and you can even jump around (if you were so inclined). But we C# developers, we don’t like a lot of ceremony. There’s just too much to type in a for loop. There’s other thorny things too like you need to know the length of what you want to loop over, and need a way to get each item by its index. To avoid these pesky challenges, and to reduce the dreaded ceremony, we have a cheat feature… the foreach loop.

I said in this series that .NET is a productivity platform, and the foreach is a great example of that. It’s easy, and you don’t have to worry about some of the details. But there are things going on under the hood to power that magic.

Let’s rewrite the two benchmarks using the foreach loops into what the compiler will see:

	[Benchmark]
	public long EnumerateList()
	{
		long Total = 0;

		for (int c = 0; c < NumberOfEnumerations; c++)
		{
			List<int>.Enumerator enumerator = _List.GetEnumerator();
			while (enumerator.MoveNext())
			{
				Total += enumerator.Current;
			}
		}

		return Total;
	}

	[Benchmark]
	public long EnumerateIEnumerable()
	{
		long Total = 0;

		for (int c = 0; c < NumberOfEnumerations; c++)
		{
			IEnumerator<int> enumerator = _IEnumerable.GetEnumerator();
			while (enumerator.MoveNext())
			{
				Total += enumerator.Current;
			}
		}

		return Total;
	}

See the issue? It’s easier to see now, but still very hidden.

List<T> exposes three (!) different ways to get an enumerator:

        public Enumerator GetEnumerator()
            => new Enumerator(this);

        IEnumerator<T> IEnumerable<T>.GetEnumerator()
            => new Enumerator(this);

        IEnumerator IEnumerable.GetEnumerator()
            => new Enumerator(this);

That Enumerator being returned is implemented as a struct inside of .NET:

public struct Enumerator : IEnumerator<T>, IEnumerator
{
	private readonly List<T> _list;
	private int _index;
	private readonly int _version;
	[AllowNull, MaybeNull] private T _current;

	internal Enumerator(List<T> list)
	{
		_list = list;
		_index = 0;
		_version = list._version;
		_current = default;
	}

See it now?

It’s a tricky one.

When we do foreach on a List<T> we use that struct List<T>.Enumerator. Because it is a struct, it lives on the stack, no allocation necessary. When we do a foreach on IList<T> or IEnumerable<T>, we end up with an IEnumerator<T>. IEnumerator<T> is a reference type. What that means is struct List<T>.Enumerator has to be boxed onto the heap so we can refer to it by reference. That’s why we are paying for an allocation.

[GetEnumerator is actually pretty magical. Check out Eric Lippert’s classic post about how pattern matching is used to find the method foreach will bind to.]

If we look at the fields of List<T>.Enumerator we see it has a pointer to the List<T> being enumerated, that’s going to be 4 or 8 bytes depending on 32 bit or 64 bit processor. It has int _index and int _version that’s 8 bytes. And it has T _current which will depend on the type of the items contained in the list. Ours is a list of int, so another 4 bytes. On a 64 bit processor, we’re looking at 20 bytes being allocated?

This allocation, my friends, is a tragedy. No one would ever implement an allocation when they could do a simple for loop! But we do, because foreach is easy.

You are probably paying this penalty all over the place without even knowing it. It’s common practice to pass around IEnumerable<T> or IList<T> instead of List<T>, for really no other reason than specifying the least-specific type opens up more avenues for future change. If we passed around the concrete types (List<T>, Dictionary<TKey, TValue>, etc.) everywhere our code could actually loop more efficiently.

The takeaway from this whole thing is this: If your hot-path is going to be doing a foreach, watch out, there be dragons.

Is there anything we can do about this?

I do have an extreme solution to this problem. Your first choice should be to do a for or foreach against the concrete type (List<T>). But if you can’t, because you don’t control the API, or whatever, we can use a DynamicMethod to call the underlying GetEnumerator that returns the struct.

private static AllocationFreeForEachDelegate BuildAllocationFreeForEachDelegate(Type enumerableType)
{
	Type itemCallbackType = typeof(StructEnumeratorForEachDelegate<TItem, TState>);

	MethodInfo? getEnumeratorMethod = ResolveGetEnumeratorMethodForType(enumerableType);
	if (getEnumeratorMethod == null)
	{
		// Fallback to allocation mode and use IEnumerable<TItem>.GetEnumerator.
		// Primarily for Array.Empty and Enumerable.Empty case, but also for user types.
		getEnumeratorMethod = s_GenericGetEnumeratorMethod;
	}

	Type enumeratorType = getEnumeratorMethod.ReturnType;
	bool isDisposable = typeof(IDisposable).IsAssignableFrom(enumeratorType);

	DynamicMethod dynamicMethod = new DynamicMethod(
		nameof(AllocationFreeForEach),
		null,
		new[] { typeof(TEnumerable), typeof(TState).MakeByRefType(), itemCallbackType },
		typeof(AllocationFreeForEachDelegate).Module,
		skipVisibility: true);

	ILGenerator generator = dynamicMethod.GetILGenerator();

	generator.DeclareLocal(enumeratorType);

	Label beginLoopLabel = generator.DefineLabel();
	Label processCurrentLabel = generator.DefineLabel();
	Label returnLabel = generator.DefineLabel();

	generator.Emit(OpCodes.Ldarg_0);
	generator.Emit(OpCodes.Callvirt, getEnumeratorMethod);
	generator.Emit(OpCodes.Stloc_0);

	// try
	if (isDisposable)
	{
		generator.BeginExceptionBlock();
	}

	generator.Emit(OpCodes.Br_S, beginLoopLabel);

	generator.MarkLabel(processCurrentLabel);

	generator.Emit(OpCodes.Ldarg_2);
	generator.Emit(OpCodes.Ldarg_1);
	generator.Emit(OpCodes.Ldloca_S, 0);
	generator.Emit(OpCodes.Constrained, enumeratorType);
	generator.Emit(OpCodes.Callvirt, s_GeneircCurrentGetMethod);

	generator.Emit(OpCodes.Callvirt, itemCallbackType.GetMethod("Invoke"));

	generator.Emit(OpCodes.Brtrue_S, beginLoopLabel);

	if (isDisposable)
		generator.Emit(OpCodes.Leave_S, returnLabel);
	else
		generator.Emit(OpCodes.Br_S, returnLabel);

	generator.MarkLabel(beginLoopLabel);

	generator.Emit(OpCodes.Ldloca_S, 0);
	generator.Emit(OpCodes.Constrained, enumeratorType);
	generator.Emit(OpCodes.Callvirt, s_MoveNextMethod);

	generator.Emit(OpCodes.Brtrue_S, processCurrentLabel);

	if (isDisposable)
	{
		generator.Emit(OpCodes.Leave_S, returnLabel);

		// finally
		generator.BeginFinallyBlock();
		{
			generator.Emit(OpCodes.Ldloca_S, 0);
			generator.Emit(OpCodes.Constrained, enumeratorType);
			generator.Emit(OpCodes.Callvirt, s_DisposeMethod);
		}

		generator.EndExceptionBlock();
	}

	generator.MarkLabel(returnLabel);

	generator.Emit(OpCodes.Ret);

	return (AllocationFreeForEachDelegate)dynamicMethod.CreateDelegate(typeof(AllocationFreeForEachDelegate));
}

That will allow you to do your loop like this:

[Benchmark]
public long EnumerateStructEnumerable()
{
	long Total = 0;

	for (int c = 0; c < NumberOfEnumerations; c++)
	{
		ListStructEnumerator<int, long>.AllocationFreeForEach(
			_IEnumerable,
			ref Total,
			s_EnumerateChildListOfStructsAsStructEnumerableForEachRef);
	}

	return Total;
}

private static readonly StructEnumeratorForEachDelegate<int, long> s_EnumerateChildListOfStructsAsStructEnumerableForEachRef = EnumerateChildListOfStructsAsStructEnumerableForEach;

private static bool EnumerateChildListOfStructsAsStructEnumerableForEach(ref long total, int value)
{
	total += value;

	return true;
}

I have the code up on GitHub for this if you are interested. I haven’t pushed a NuGet yet for it because I’m not sure about the usefulness, generally speaking. There are cases where it is useful, absolutely, but is it widely useful? And is the API easy enough? Not sure yet!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.