Why there needs to be a 4th pillar of Observability
Logs are the core of the human-machine interface for software developers and operators. Historically, they are very much like caveman paintings. They were our first attempt to express and understand how our software was working.
For several decades, logs were an island of calm in a rapidly changing technological ecosystem. Logs remained the same even as software services became web-based and grew in scale. We added context to make them easier to search through, moved them to a structured format, and over the past decade or two, started to aggregate and index them for ease of use.
And yet, at some point, that wasn’t enough. Thus, the three pillars of Observability were born: Logs, Metrics, and Traces.
Why do we need Metrics?
One of the most common questions we ask ourselves while monitoring the web server is, “how many requests for that URL did we get over the last minute?” To answer this question using logs, we must collect logs from all servers, parse individual lines, filter the relevant URLs, and count the results.
Whether we build a dedicated pipeline for this metric or calculate it by querying a fully indexed logs database, it’s a long and arduous process for both man and machine and unlikely to give us results in real-time.
Think of metrics as a way to efficiently aggregate multiple log lines of the same instance at the source application. By counting (or using other forms of aggregations such as summing) each event, you can efficiently get a real-time value of the behavior of your application as a whole.
A much more efficient way to get high-quality data is to create a counter inside the application and export it to the Observability stack, which will aggregate it and produce the relevant reports.
So where does Distributed Tracing come in?
Modern web applications are running on a much grander scale than ever before. We shifted our engineering paradigms and have adopted new architectural patterns, such as microservices and reactive programming.
Unfortunately, this has fundamentally broken the unwritten promise of logs: that we can tell the story by connecting the dots one log line at a time. One can no longer assume that two consecutive log lines are part of the same request, or even use process and thread IDs to build the timeline.
Distributed Tracing is a way to generate the timeline of individual requests and other processing tasks. This way, we can easily keep track of each step within the flow, even as it crosses service and functional boundaries.
What’s still missing?
By adding Metrics and Distributed Tracing, the three pillars of Observability significantly improved the operational paradigm of modern cloud-native applications.
Metrics allow us to bind log lines vertically and see how the system behaves over many requests. Tracing allows us to bind log lines horizontally and know how the system behaves through the lifespan of a single request. Both tools are super valuable for understanding the system as a whole and excite SREs and architects across the globe.
And yet, for most software organizations, the software developer is the most common engineering role-those poor souls who spend most of their time writing and debugging code.
We shift responsibility left and want engineers to own their code across the whole software development lifecycle, all the way to production. They don’t care about the number of requests or how requests cross service boundaries. What they want to know is how the code behaves.
What does it take to understand the code?
The incredible power of modern code is that the sum is way more than the value of its parts. Each variable is an abstraction, combining code and data to provide superb power with only a few characters of text. The layers stack on top of each other.
The code in question might be your code, or it might be first, second, and third-party packages and services, many of which are open-source. The data comes from various configurations, databases, caches, user settings, user inputs, feature flags, and more. Add to that the current state of the application, which often brings its own set of caveats, especially for long-running processes.
Squeezing that invaluable context into a single log line is no picnic. When stringifying primitive values into a log line, you lose some of the finer points, such as type information. When stringifying complex objects, the challenge is even greater.
Will you take a lean approach and miss out on invaluable information? Or will you take a deeper capture and impact the application’s performance? Chances are, you won’t bother in the first place, and pray that whoever built the library provided a decent stringification flow that won’t do too bad on either front.
Even worse, the current line is only a tiny part of the application state. What about the stack trace, the request context, or other valuable information?
What’s better than logs? Snapshots.
Snapshots as the fourth pillar of Observability that meets that need. By capturing most of the relevant application state, you get a clear, detailed, high-fidelity image of what’s happening. To paraphrase: a Snapshot is worth a thousand log lines.
Snapshots provide everything you need to know. Variables are captured with full fidelity, maintaining type information and exact representation. Objects are captured by individual attributes, and collections are appropriately enumerated. The stack trace and other global variables are readily available.
As is often the case with software engineering, Snapshots are not a new concept. Operating systems such as Linux and Windows had snapshot tools (core dumps) for years, used to analyze kernel and application crashes. Error monitoring tools such as Sentry or Bugsnag utilize (limited) snapshotting capabilities focused on errors. For more recent examples, developer Observability platforms such as Rookout are heavily focused on Snapshots.
How do we use Snapshots?
To meet the needs of modern development, we need to put snapshots at easy reach for every developer. We need to give them the ability to decide ahead of time which obscure edge cases to snapshot for ease of reproduction and fixing. We must allow them to snapshot unexpected events in real-time to understand and remediate them. Therefore, we should build monitoring tools that intelligently identify and snapshot interesting events for easy analysis. Lastly, we must build automation engines that correlate data from other sources and automatically collect snapshots.
Snapshots are the key to unlocking peak efficiency and effectiveness for engineering organizations in these turbulent times. Even more important is the potential impact on engineering culture. By empowering engineers to witness how their code runs in production, we promote a true shift-left culture and create day-to-day ownership of their code across the software development lifecycle.
After all, developers deserve a pillar too.