Make Mistakes, Recover Fast
Making mistakes is a part of life. It’s how we learn and grow. However, many people – and teams – struggle with the aftermath of making a mistake, and it can be challenging to recover quickly. As managers, we want to do everything in our power to avoid having our team make mistakes and theoretically get them straight to the aftermath of the mistake in which they have the full understanding and clarity of what to do, how to act, and have learned the best method to deal with the situation.
While we can do everything we can to get them there, mistakes still happen. And some we even make ourselves. The least we can do to remedy that is to put in the proper workflows, give them the best tools, and offer all the support and guidance in the world. And yet.
We understand that mistakes happen. Sometimes they happen because, well, we’re human. Sometimes it’s because we’re focusing our efforts elsewhere – moving fast, creating complex tech, etc. Whatever the reason, we need to let our teams – and ourselves – learn from these mistakes and focus on making sure that as little as possible is impacted by the mistake.
This may seem quite obvious, but if we actually think about it, much of what we do is powered by the fear of making a mistake. We all know that making a mistake when it comes to software development comes with much bigger repercussions than forgetting to turn on the dishwasher. We’re looking at downtime, impacted customer service, negative customer experience, wasted resources, and more. None of these are simple, and none should happen.
So let’s take a look at the top 5 ways we can ensure fast recovery from mistakes.
#1 – Upgrade your tools and environments to support a production-first mindset
The mindset of putting production first allows you to identify issues early. It involves constantly monitoring the production environment for issues and proactively identifying potential problems. By identifying issues early, organizations can take action to resolve them before they become major incidents.
This mindset adoption also allows you to respond quickly and prioritize reliability. By having the right tools and processes – such as response plans, playbooks, and production-grade tools – in place, teams can quickly respond to incidents as they arise and resolve them. They can also prioritize reliability over other considerations, such as feature development. This ensures that their systems remain resilient and recover quickly from any incident.
#2 – Your SRE team is great, but they’re not enough
When the going gets tough, you also need developers who understand the production environments you’re working with and can fully own them – all the way from dev through staging and up through production.
While this one may sound the scariest (we’ve heard quite a number of developers over the years tell us, “what do you mean debug in production? I don’t even have access! I’m not allowed to touch it”) – it’s the best move you could make. It’s paramount to effective software development that developers understand how their code is deployed and runs in their production environment. This includes understanding the infrastructure, dependencies, and other factors that can affect the performance and reliability of the application.
Connecting developers to production gives them the ability to quickly identify and troubleshoot issues that arise in the production environment. This can lead to faster resolution times and reduce the impact of incidents on customers.
#3 – Measuring success the right way
Oftentimes it’s not about the mistake itself that was made, but about how you recovered from it. Or better yet, was it a failure or a success? Was the incident response effective? How many deployments result in failure? How is your performance? How reliable is your service? Were your customers impacted? And to understand that, measurement is critical.
To get a better understanding of that, it’s crucial to measure key metrics such as MTTR, change failure rate, service level objectives, and customer satisfaction, to begin with. This will help drive continuous improvement of your system.
#4 – Account for human error
Let’s be honest. Despite our best efforts, humans are not infallible. This is exactly why you need to account for human error. You can do so in several ways.
Build a blameless culture that encourages team members to admit to mistakes and take ownership of them without fear of punishment or retribution. This promotes transparency and accountability and allows teams to learn from mistakes and improve their processes.
Provide training and support for your team members, including training on best practices, providing tools and resources to support the team, or even promoting work-life balance to reduce burnout. Additionally, automating repetitive tasks or processes can help reduce the likelihood of human error. Look for opportunities to automate tasks that are prone to error or that are time-consuming.
Dare to debug in production without changing any code!
#5 – Second and third times aren’t the charm
Don’t repeat your mistakes. The most important way to recover quickly from mistakes is to learn from them. The two best ways to ensure that you don’t repeat is by focusing on continuous improvement and conducting post incident reviews.
A culture of continuous improvement ensures that teams are constantly seeking to improve the performance and reliability of their systems. This includes conducting post-incident reviews to identify the root cause of issues and implementing process improvements to prevent similar incidents from happening in the future.
So stop losing sleep rehashing how you could have done things differently, caught the issue earlier, etc. Move forward and embrace those mistakes – and make sure your team does too. Just don’t forget to have a plan in place for how you deal with them.
And when it comes to bugs that pop up in production, we believe the easiest way to recover from them is using a tool that will instantly allow you to get debug data. And you know who to come to for that 😉