Written in collaboration with Mickael Alliel
Modern software production stops for no one, and everyone is needed to keep it rolling. Every dev is on-call. Great speed and friction produce a lot of heat, and when everything is on fire all the time, even the best devs and engineers struggle to keep the train speeding onwards without getting burned.
What makes maintaining modern production so hard? And what is the difference between being good and being bad at dev-on-call? Let’s dive in and see.
In days gone by, software projects were far simpler things than what we know today. As we moved from single-process desktop apps to large-scale, distributed, cloud-based solutions, simplicity was run over by the move-fast-and-break-things truck of complexity. Supporting and maintaining software evolved from a simple task carried by small teams with a basic skill set, to company-wide efforts, requiring our best engineers.
Nowadays, software projects comprise multiple services and microservices distributed in the cloud (as well as on-prem, at the edge, and on physical devices). Each service is created by a different dev, a different team, and maybe even a different department or a third party. However, all of these parts must harmoniously play together as a beautiful orchestra, and as we’ve mentioned before: stopping or pausing is not an option.
It doesn’t matter if a company is aiming for 99.9 or 99.99 percent uptime, or whether it settles on a mere 90%. There is no real way of avoiding a 24/7 dev availability pattern. This brought about the ridiculous fast growth of solutions like Pagerduty, Opsgenie, Victor-ops and more.
So now we page, alert, and wake our devs around the clock. But even if they can shake the daze of sleep and sand from their eyes at 3 am, can we expect them to succeed? It turns out that the late hours are the least of your devs’ problems.
The following is a thorough yet incomplete list of the challenges devs often face while being on-call. To understand the hardships devs go through when they’re on-call, you’ll have to put yourself in a developer’s shoes for the next few paragraphs. It’s going to be quite a journey, are you ready? Here we go.
Context-switches: Being dev-on-call takes a toll on your mental health. You are facing context-switches between your regular tasks and production issues which keep popping up every so often. This requires you to stop everything you’re doing and take care of the issue at hand. Good luck going back to… What was it I was doing again?
Handoffs: With software constantly becoming more complex, it’s rare for one developer to have all the knowledge, skill, and expertise to fully resolve an issue. This often requires incidents to involve multiple team members, and escalate or handoff the case to another fellow. Also, let’s be honest for a sec here. When you’re fixing something in the middle of the night you just want to go back to sleep. The last thing you want to do is write down what happened and how you fixed it. By the time the next person gets to be on-call, you’ll probably forget to tell them some crucial piece of info that would’ve saved the moment when the same issue occurred again.
This one is kind of a no-brainer. You’re working late, long hours while being the sole person responsible for solving critical issues that may pop up in unexpected times. The pressure IS ON (we’ll save you an extra Queen reference here)! Joining this party are also blame, shame, public embarrassment, and their millennial compatriot - FOMO (fear of missing out).
After working a 10-hour work day, you go back home but you need to take care of something again. You may have to cancel some social events you wanted to attend. If you’re in an adventurous mood, when you finally go out to dinner with a friend, you inevitably need to take out your laptop because an alert has just popped up. Tired, you fix the issue rapidly but then, when something else breaks down because of it, you’re the one getting blamed.
Dependency on others and proprietary practices: You might be navigating unknown territories and you’re not familiar enough with the code where the issue arises. Not knowing how to query logs efficiently, or what to look for in APMs and metrics systems can make dev-on-call duty unbearable.
Missing internal documentation on resolving the issue: There will always be a moment when you stumble upon something that was done 6 months ago, went undocumented and needs to be fixed. This takes me back to the time I was renewing an SSL certificate to use in a Google cloud storage public endpoint. I found a bash script buried deep inside of folders nobody has ever bothered to check. The script used a command-line tool that has become deprecated, was renamed, and has changed its default configuration since the last time we used it. How was I supposed to know that Google accepts ‘ec256’ generated private keys, but the default of our command-line tool generated ‘ec384’ type keys? When Google fails it just says it failed, nothing else. Sometimes Google just doesn’t have all the answers.
Limited access: You’re on-call. You’re the one person who is supposed to take care of issues when nobody else can, and yet, you don’t have access to the database. You cannot update users nor this one script that can save the company since it requires a password you do not have. It's 2:30 AM, of course, no one with the ability to help is answering their phone.
Limited visibility/observability: New components and systems are added to software projects on a daily basis, and code depth constantly increases. Even with all the available logging, APM and tracing solutions, you often find that the answer to the problem you’re trying to solve is beyond your reach. The issues that have logs/ traces/ exceptions/ etc. in the first place are the ones we already know about. What about all the rest? It’s rare that humans (devs included) do a good job of predicting the future.
Once, when I was on-call, I had an easy task to implement: send an SMS with a confirmation code. I thought to myself, “this is a perfect serverless use-case,” and went on to write a lambda function only to forget that the easiest tasks always come back to haunt me. Lambdas are a pain to debug since there was (see below) no easy way to observe them without updating them. And so, I had to go through hell to understand what was going on in my serverless function.
In distributed cloud computing, finding out where the issue came from and which server to debug is not always a trivial task. Microservice architecture, multiregional cluster, load balancing, thousands of requests per second. Do all these buzzwords sound familiar? Well, imagine how I felt when a customer was sending a badly formatted request. I can tell you this for sure: it was not making things better. Finding out who it’s coming from is easy. The hard part is figuring out where it’s going and intercepting it soon enough to get something useful out of the server crashing. Because the logs just aren’t enough. That’s when you connect a remote debugger into a random server hoping for the jackpot. Oh, and it’s also when you wish you had added more logs last week.
You can step out of the dev-shoes now. That was quite a ride now, wasn’t it? Is it any wonder then, that being on-call is so frustrating for devs? With so much working against them, are we setting our developers for failure (sooner or later) in dev-on-call? How can we set up the playing field for success?
Being on-call infringes on the developer’s personal time, which isn’t fun for any of us. However, a lot of the negativity around dev-on-call actually comes from the organization and its culture. If the organization doesn’t value and give proper incentives and compensation for the developer’s investment and time, frustration and resentment will be quick to follow.
You can do that via a healthy on-call rotation. Make sure you have a supportive team of engineers who have a deep understanding of the system and its architecture. Moreover, make sure they have the best tools available to help them solve issues faster.
Every time an on-call issue is resolved - it must be documented. Because it WILL happen again. When it does, you’ll be happy your devs took the time for the documentation process. Teams must understand that when they’re unwilling to document an issue, they’re simply shooting themselves in the foot.
Promote teamwork and good communications within your R&D department. That way, when your devs are stuck and unsure of what to do, they won’t gamble. Knowing their team is fully behind them, your devs will call someone who knows. Sure, it might bother them, but it will probably save the entire dev team and the company a whole lot of trouble in the long run.
We’ve all heard the good old “The dev on-call before me, didn’t tell me about issue X/Y” excuse. Well, that’s now a problem for the current dev-on-call to solve. Motivate your devs to ask questions! The previous dev-on-call may have been too distracted or too tired to document an important issue. It is every dev’s responsibility to keep surprises to a minimum by asking the ones who came before them as many questions as possible.
Finally, encourage developers to learn from the experience of others. Seek and learn from other companies’ dev-on-call War Stories. They might come in handy when your devs run into a similar issue when they’re on-call.
By now everyone knows the basic tools of the SRE/Dev-on-call trade: Using round-robin scheduling to wake the devs with paging solutions (Pagerduty, Opsgine, etc.); Syncing them on tickets with ticketing systems like Jira and Zendesk, all initially triggered by APM solutions (AppD, Datadog, Newrelic, Prometheus etc.) or exception management (like Sentry.io or Rollbar). But what’s the next step? How can technology help us face the remaining challenges of dev-on-call work?
A repeating theme we noticed in the challenges of dev-on-call is access to data. Access to any type of data can make a difference, be it organizational data, operational data, behavioral data, or any other kind. Developers at large and those who are on-call in particular, require the ability to access data around and within the software and share it in a clear manner within a team.
Existing platforms, such as the exception management platform Sentry.io, are expanding to add more integrations and team management capabilities. Their aim is to create better communication around errors and incidents. New solutions like Blameless.com offer tailored experiences for SRE/ dev-on-call team flow. Aimed at bringing a more systematic approach to both incident data sharing, and post-mortem data sharing, while setting the ground for automation / AI for incidents.
On-the-fly data collection solutions like Rookout provide a platform for retrieving data-points, variables, log-lines, and metrics from LIVE software, on-demand with non-breaking breakpoints. This enables devs (on-call and not), DevOps, support SREs and others to instantly access data in production code and share it with the rest of the team to drill down the issue.
Rookout connects to other tools like logging, APM, Slack, and more, allowing users to aggregate all necessary data for sharing in the organization's data-sink of choice. With the democratization of data, Rookout empowers multiple personas to take part in dev-on-call, thus making handoffs much easier. And the best part? It’s available for free here.
As the ecosystem matures, the aggregation and sharing solutions are beginning to interconnect. This is clearly seen in the integration between Rookout and Sentry.io in which devs can move directly from an alert to accessing and sharing more data with the team.
The cloud is steamrolling at your door, can your devs stand the heat? With the complexity of the dev-on-call challenge crystalized and the key methods to approach the challenge both with culture and tech, we believe you can fend off the flames.
Got dev-on-call war stories to share? Don’t be shy shoot us an email at [email protected], and you can have your story immortalized as part of the following posts in this series.