It all started when I was testing Rookout on Docker (with Alpine and Python).
Rookout is a new approach to data collection and pipelining from live code. We basically allow developers to request any piece of data with just a few clicks and view it on their machine in any framework, cloud, and the environment they use.
So, there I was, testing our application when I realized that some commands were working just fine and some commands were being ignored.
The situation seemed pretty odd. Why would the container respond to some of the commands and ignore others? I started with a standard SSH and ran few basic Docker test commands such as “docker ps” to see which containers were running.
I noticed that the agent restarted 139 times in quick succession (!), while it had been running only one second, which led me to believe it was constantly crashing.
After running “docker event”, my suspicion was confirmed.
I immediately noticed that the ‘demo_agent_1’ was constantly dying when some commands were sent and that Docker was silently restarting the application. Now the question was: Why is the container dying?
I ran “docker logs” but it wasn’t informative enough to shed light on my problem, so I decided to get inside the container to manually run the process.
This time, I got some additional information: A segmentation fault was crashing the process and causing the Docker container to exit.
There are two ways you could debug such crashes under Linux. Both provide similar results.
Note: If you can’t find a crash dump file, follow this link to make sure it’s generated and figure where it’s located
Whether we used live debugging or loaded the core dump file, we see exactly where the segfault is happening: Python/getargs.c:1476
SEGFAULT is a Unix signal indicating an invalid memory access performed by our application. Invalid memory access is usually triggered by one of two conditions:
We know that the exact location of the invalid memory access from GDB. Let’s take a look at the code (GitHub):
You can clearly see that line 1476 is the function’s opening curly braces, where the stack initialization code occurs. This surely means we have a stack overflow :(
But wait a second! If our stack is so short, how did it already reach an overflow? As you may know, Alpine Linux is based on msl libc to minimize resource usage (especially memory). If you dig around the web long enough, you’ll eventually come to realize that musl was originally designed for embedded systems, and one of its differences is a very small default stack size of 80KB. Python's CPU & memory are hungry and were designed to use much larger stacks, even for very simple scripts.
Fortunately, threading.stack_size can be used in Python to increase the underlying OS stack size.
This is how I fixed it:
When using this workaround, pay attention to the following caveats:
In the months between encountering this bug, working around it and publishing this blog post, the bug was fixed in the latest official Python Docker images and in the Python APKs for Alpine 3.7 and Python2. If you haven't upgraded yet, here's yet another reason to do so. If for some reason you cannot upgrade, feel free to use the above code snippet as a workaround.