At the height of the WCF (Windows Communication Foundation)/SOA (service-oriented architecture) era I took a job working on a payment processing gateway that would eventually grow to handle tens of millions of transactions a day, conservatively. It was deployed as a bunch of Windows Services in the self-hosted WCF style. To protect our customers (and satisfy our PCI DSS auditors) we put in place very tight controls separating out those whose job it was to operate the system from those whose job it was to develop it. No developer was ever allowed to access anything in production. This was great, and necessary, but meant that we were also blind when it came to observing what the system was doing. As the load grew we had to learn how to do a bunch of things totally hands-off: support our application, monitor its performance, deploy it, configure it, scale it, understand how it was being used by customers, and even, how to sell our customers data back to them. Some things we were really great at and some things were always a struggle.
What this series is going to be about is the logging and debugging lessons I learned working in that environment for 10+ years. Most of this is probably standard practice these days, but maybe it will help some of you along.
Why logging?
No piece of software is perfect and there are always bumps along the way. After you’ve had to answer questions a few times in the dark about what your code is doing, and why something isn’t working, you realize quickly how important it is to log/trace the important parts of your logic. Doing it well is kind of an art form. Too much and you are buried. Too little and you can’t figure anything out. As all the devs learned this lesson our services started to log a lot of data, and the data became more and more critical to our enterprise. The first question we would ask the support team, or QA, when they reported an issue: Where are the logs? At first, getting logs was hard and for a long time working with them was even harder. But this is one area in which we eventually excelled.
Part 1: Service debugging UI.
Part 2: Structured logging & message correlation.
Part 3: Working with your log data.
Part 4: OpenTelemetry integration.