Profiling Schrödinger’s Code
In modern software development and operations, everything can be monitored. This isn’t a matter of technology. If you want to monitor something, you can. However, modern monitoring tools come with a price, and while sometimes that price isn’t too high, at other times the cost can be unbearable. For example, an APM tool that monitors your server’s health with CPU and memory metrics is pretty cheap and non-intrusive. However, if you want to profile your application’s performance – more specifically, if you want to pin-point that piece of code that hits your performance – you will have to pay with your own application’s performance.
Isn’t that a bit paradoxal? To analyze and improve your own application’s performance, you will need to hit your application’s performance. But it isn’t only a paradox. That’s how modern profiling tools work, by operating as an “all or nothing” solution. But that’s just silly, because that’s like using a 20 pound sledgehammer to screw a lightbulb. And while that might make for a great post on r/funny, it’s really not the path you want to be taking. The use of modern technology and the right tools when monitoring can help you out with surgical precision.
Everybody who has studied a bit of physics, or who has seen enough cartoons in the recent decade, knows about Schrödinger’s cat. Simplifying this thought experiment would be describing a paradox about a cat, who’s inside a box, and who is considered both dead and alive at the same time. How can that be? Well, inside the box there is a flask of poison, and unless you open up the box and take a look, you won’t know whether the cat is dead or alive. But, if you open up the box it will tilt the flask of poison which will kill the cat. So, you can never know whether the cat is dead or alive and trying to look at the cat’s state will kill him anyways. The paradox essentially is this: the method that you use to monitor your subject affects your subject’s state.
Application profiling tools remind me of this paradox. If you want to understand what pieces of your code degrade your applications performance, you must degrade your application’s performance. Profiling tools not only instrument every piece of your application’s code, but register every line hit and every function invocation and exit. The TL;DR? They monitor everything. And that makes everything slow.
Profiling should be less painful
Let’s all be honest. Investigating performance issues is a nightmare and everybody hates doing it. The first pain in performance issues is the fact that you won’t notice them while developing, which means that of course you’ll create automation for stress testing your application, but it has its limits. You will find performance issues and fix them in dev or maybe even in staging, but that’s the easy stuff. Performance issues hit you hard when you least expect them… and that’s in production. Well, it’s not that you don’t expect them per se, but usually it’s when you’ve already moved on to the next task and it’s become but a mere memory. Once you hit those issues in production (or to be more precise: when they hit you or backstab you), you’ll have to solve them and it will usually be near impossible to reproduce them in dev. The performance will probably depend on multiple services in production, whether they are your own micro-services or even 3rd party SaaS. Sometimes it will be hard to even spin up those services in dev.
The next thing you’ll try to do is understand how to use a profiler in production. After that, you’ll try to understand how to explain to your manager (or customers) why you need to use a profiler in production and degrade the performance. You’ll tell your customers: “Please have some patience, I need to check whether my cat is dead, so I’ll just open up the box”. But they won’t like the aftermath, as they know that opening the box will kill the cat.
When “All or nothing” is not the only choice
When we meet with our customers, whether at an annual feedback session or when we finish a POC, we ask them what else they wish they had Rookout for. The magic of Rookout is that Rookout allows its user to understand that they don’t need to collect everything, but rather can collect only the pieces of information that they need. You don’t need to collect millions of logs in fear of missing out on some data, you can just collect the data when you need it. When our customers told us that they wanted Rookout to help them solve performance issues, we picked up the gauntlet and went to the drawing board. We decided that we want our customers to profile only the pieces of code that they want to profile. We wanted to address their pain and provide them with a tool that helps them understand whether their cat is dead or alive, but without killing it.
Our goal was to give the user the capability to surgically profile their code, anywhere, anytime and without any performance degradation. We already had Rookout’s basic building block of instrumenting our customers code in realtime, and that’s pretty much all we needed. Once our user wants to profile their code, all they need to do is activate the profiling mode in their Rookout dashboard.
The next step would be to locate the piece of code that they wish to profile, click on the gutter, and place two profiling points. From then on we will start measuring the time taken between them.
That’s it! So simple, so clean. You don’t need to instrument everything, you don’t need to kill your application, and the best part? You don’t need to kill your cat. You can go a step further and also start adapting and changing your profiling points, you can start measuring times while going down the stack. When you understand that the performance issue is in a certain area, you can start pinpointing the measurement to other small spans until you pinpoint your issue. You can also set conditions in which the measurement will take place. Maybe there is only one server or one type of request that needs to be profiled? All of this will happen in your production environment and it won’t degrade your entire environment’s performance.
Agile flame graphs
Being agile isn’t only about development, it’s also about how you solve problems and monitor your application. Modern software development roots for agile development by developing your software one step at a time, releasing each step, and learning from it before the next step is developed. We should also practice agile when solving problems. We don’t need to profile everything, as it will slow everything down and there won’t be any learning cycle. The right way is to start profiling in an agile manner, either from the bottom up or from the top down. If you start profiling from the bottom, you can start eliminating small methods which aren’t the root cause of your performance issues and then climb up to the root cause. If you start profiling from the top, you can understand the ballpark of the root cause and start surgically find the root cause by going down the call stack. Either way, you do it step by step, while learning from each iteration and profiling only the parts which are relevant to your investigation.
Don’t open the box, use a webcam
Telling the paradox of Schrödinger’s cat to a Gen Z (or a zoomer) will sometimes get you a chuckle and a simple answer of “Take a webcam and place it inside the box before you close it”. Well, Erwin Schrödinger devised the experiment in 1935 when webcams were quite hard to find. The next time you think about profiling your application in production, think about using Rookout: be kind, don’t kill cats.