Part of the series: Logging and Debugging High Throughput Services
Also part of the series: Fun with OpenTelemetry .NET
Update: The code discussed in this post is now available on GitHub.
The logging we have been talking about is great for analysis, but sometimes you want something even easier. Wouldn’t it be great if there was a tool that could just show us what our applications are doing? If you haven’t already, check out OpenTelemetry.
What is it? A lot of people ask me this. Really, it’s just a specification. Just a document. The specification has two specific areas, distributed tracing and metrics. There might be another area added for logging in the future, but for now, that’s what we’ve got.
Distributed tracing is really what this blog series has been talking about. Tracing a single request throughout all the applications in our enterprise, and maybe our partners’ enterprises too. Metrics are more like performance counters. If you wanted to know, how many requests am I getting per second? That would be metrics.
So there’s this open specification and then there are libraries available in (practically) every language implementing that spec. You really don’t need to worry about the specification, you just grab a library and off you go. We have .NET, Java, Python, Go, Javascript, Erlang, Rust, PHP, C++, Ruby, Swift, and more. The list just goes on and on. If your favorite isn’t on there, hey, it’s an open spec, go and get started!
Any way you plug one of these OpenTelemetry libraries into your application, and it will start capturing and spitting out data. It will also transmit some state about the active trace to downstream systems as a request is processed. If those systems are also OpenTelemetry-aware, they will add to your trace. On and on. What data is captured, how much is captured, and where it goes is really up to you.
I’ve been doing a lot of work on the .NET library. The reason so much time passed between the last post and this one is the OpenTelemetry .NET library went through a big refactor. I didn’t want to post a bunch of code that would be outdated as soon as it was done!
It’s pretty interesting what went on. The .NET Team at Microsoft basically built the OpenTelemetry tracing API directly into System.Diagnostics. It will release with .NET 5 via the System.Diagnostics.DiagnosticSource package. We get this new thing called an ActivitySource and a bunch of new stuff on the existing Activity object. What this all means is library authors can use this new built-in stuff to instrument their libraries and then OpenTelemetry (or anyone, really) can consume the data and do interesting things with it.
The .NET library currently has instrumentation for capturing incoming Http requests. It has instrumentation for capturing outgoing Http requests, outgoing SqlClient requests, outgoing StackExchange.Redis requests, and outgoing gRPC requests. In a secondary repo there’s a library for Azure clients, for EntityFramework Core, and a MassTransit library is open in PR. There’s an example using RabbitMQ. There’s a blog series for NSericeBus.
There’s a lot of stuff out there to play with and more things are being added everyday. I’ll probably start up a blog series about it next.
But OK, OK. I’ve gotten way ahead of myself. Let’s look at it, shall we?
For this demo I built a web frontend (.NET Core Razor Pages) and an API service (.NET Core MVC). The web frontend allows people to enter a search query. The query is sent to the API service which will write the query to SQL, make an Http request to Google, and then return the response for display to the user. I’ll admit, this is an offensively contrived example. I’m sorry about that. There are so many blogs showing something like this, I really wanted to start with something more interesting. But for this series, it really needed to be this. I will do more interesting things, I promise!
What I want to show is the trace flowing through 2 fully instrumented applications (web frontend & API service) and then exiting to two dependencies (SQL & Google).
Here’s what a trace looks like:
Neat, right? Well, that is not OpenTelemetry. That is Jaeger UI, A free tool you can use to view OpenTelemetry trace data. All of the OpenTelemetry libraries allow you to plug in different exporters. For example, there’s also Zipkin. A whole bunch are supported and a lot of the contribution being done to OpenTelemetry is by vendors who provide UIs and ways to work with the harvested data.
What we can see in that image is a visualization (Gantt chart) of (essentially) the same data we have been pumping into our logs. If you click on one of the items, you get a bunch of details:
How does it work? It’s actually very simple. When the page is first requested in the web frontend, we start a System.Diagnostics.Activity. That Activity is assigned a TraceId and a SpanId based on the W3C Trace Context specification. That Activity is observable by OpenTelemetry, which adds some standard information (“tags”) to it. One of the things the OpenTelemetry specification does for us is define the standard tags that all the language-specific libraries should use. When calling the API service, OpenTelemetry adds two headers to the outgoing message: traceparent
& tracestate
. Those are part of that same Trace Context specification. When the API service receives the call, it also starts a System.Diagnostics.Activity, but instead of generating a new TraceId, it uses the one that came over in the header. The trace is restored! The same thing happens when it calls SQL, or Google.
In each application when that Activity is completed OpenTelemetry exports it to what has been configured. Above I’m using the Jaeger Exporter, which communicates the data over UDP to an agent. Jaeger’s job is to put it all together based on the TraceId (what we called CorrelationId in previous posts).
Now of course SQL and Google don’t report data back into our Jaeger, so those traces are just dependencies. But any service that is reporting, we can see everything it has reported, and its time is sliced out for us. It has been fun watching different groups at my company get OpenTelemetry up and running. Each time more of the picture comes into focus about what those systems are doing.
The value proposition was probably clear the moment you saw the trace visualization, but just to be clear, if you’ve ever struggled with your logs trying to understand where the time is going processing something, struggle no more!
But how much code does it require?
The best part is, there’s really not much code required at all. Everything I’ve shown is pretty much out of the box. If you are keen to write some code, there’s a lot more you can do with OpenTelemetry, but we’ll get more into that in future posts.
Step 1: Packages
API Service is using these:
<ItemGroup> <PackageReference Include="OpenTelemetry.Exporter.Jaeger" Version="0.4.0-beta.2" /> <PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="0.4.0-beta.2" /> <PackageReference Include="OpenTelemetry.Instrumentation.AspNetCore" Version="0.4.0-beta.2" /> <PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="0.4.0-beta.2" /> <PackageReference Include="OpenTelemetry.Instrumentation.SqlClient" Version="0.4.0-beta.2" /> </ItemGroup>
There’s a lot of packages. OpenTelemetry .NET is kind of set up a la carte style, you have to add all the specific things you want.
Step 2: Startup
In your startup, drop this:
services.AddOpenTelemetry(builder => builder .SetResource(new Resource(new Dictionary<string, object> { [Resource.ServiceNameKey] = "WebService", ["service.datacenterId"] = _Configuration.GetValue<int?>("DatacenterId") ?? 0 })) .AddAspNetCoreInstrumentation() .AddHttpInstrumentation() .AddSqlClientDependencyInstrumentation() .UseJaegerExporter());
Note: Check out the source for the web frontend version (it’s almost identical).
That’s it. Everything else is automatic. Automagical, really. And you don’t even have to add “service.datacenterId,” I just did that to demo you can add your own stuff if you want to!
How does OpenTelemetry .NET trap all the events?
There’s really no one answer. SqlClient uses an EventSource to log events, the instrumentation plugs into that. HttpClient uses a DiagnosticSource. StackExchange.Redis has its own custom thing. Each thing you want to instrument has to be analyzed and solved. But the friendly OpenTelemetry .NET contributors are taking care of all this for you. The most difficult area has actually been the .NET Framework, because in some cases it wasn’t written to be observable. The outgoing HttpWebRequest instrumentation I worked on and it was one of the most difficult things I’ve ever done. Hours reading the reference source figuring out where to monkey patch hooks into the code. Here’s the file if you want to see how it’s done. I started with the official one, but it had some issues which required some significant modifications.
All the time spent building instrumentation is why the .NET 5 ActivitySource is important! If we can get library authors to instrument their code using that API, no extra work is needed to capture it into our traces.
Wrap Up
So should you log or use Open Telemetry? Both. Your logs are invaluable. When you need to understand what went wrong, the logs have the detail. But OpenTelemetry is also a great tool to understand what your application is doing and to quickly find bottlenecks in your processes. OpenTelemetry has a lot of features I skipped over. You can turn on a sampler and just take 5% of your data, or whatever number you feel is good for establishing trends.
I’ll leave you with one tip: Be sure to use your TraceId (System.Diagnostics.Activity.Current.TraceId) as your CorrelationId in your logs so you can move between ELK (or whatever tooling you choose) and your traces (Jaeger, Zipkin, etc.) seamlessly. Some products, like Application Insights, do that for you in a single UI.