A “From the Trenches” Guide - Integrating Datadog with Kubernetes and Python

A few weeks ago we installed Datadog in our staging and production environments.
All in all, it was a smooth ride, with a few small hiccups that we resolved along the way. If you’re about to install Datadog and your environment is similar to ours (with Kubernetes, Python and these other goodies) you should find this post handy.

As a brief introduction, Rookout’s SaaS solution offers Dev/Ops teams some sleek and handy tools for rapid production debugging, including the ability to collect ad-hoc custom metrics and send them to Datadog.

When customer adoption started soaring and we were getting millions of messages per day from our clients, we figured that it's high time to take our SaaS performance and availability monitoring to the next level by adding Datadog to our own setup.

API, Agent, APM

Datadog’s monitoring solution is renowned for its ease of use and friendly pricing. That makes it a perfect match for our needs as an early-stage startup. They offer 3 levels of monitoring capabilities:

  • Monitoring third-party SaaS through an API
  • Monitoring OS and third-party applications using an agent
  • Monitoring application performance (APM)

All three levels are relevant to our business. Each requires a different degree of effort and tweaking to integrate with our existing orchestration tools.

The Rookout Environment

Rookout’s web-facing production environment is based on the following components:

  • Runs on top of Google Cloud Platform.
  • Uses Google’s DNS and Load Balancing services to expose our SaaS to the world.
  • We use Kubernetes for most of our orchestration needs. GCP has built-in Kubernetes support (GKE), which works amazingly well. We deploy applications to Kubernetes using Helm (see below).
  • Our application is written in Python with underlying infrastructure of Tornado and Flask, allowing us to maintain a rapid pace of development and experimentation.
  • We use Redis to provide us with reliable, performant datastore out of the box. Redis runs on dedicated computing instances in a high availability (Sentinel) configuration.

What the Hell is Helm?

Helm provides useful functionality on top of Kubernetes:

  • Defining applications in a reusable way (called charts)
  • Sharing applications across the Kubernetes community
  • Installing applications on your cluster in a reusable way

At Rookout, our application is defined as a Helm chart and deployed multiple times to the same cluster (production, staging, etc.). We also use Helm to deploy infrastructure services such as Fluentd.

Tips for Smooth Integration:

1. Datadog GCP and GKE integration

Datadog integration with GCP is pretty straightforward and is accomplished by adding a service account with the necessary permissions to your GCP account. Easy-to-follow instructions can be found here. In order to monitor additional elements of GCP (in our case GKE) simply install integrations from the Datadog integration page.

2. Install Datadog Agent on Kubernetes

A ready-to-use Helm chart is available here for the Datadog agent. If Helm is installed you can install the Datadog agent on your current cluster simply by running the following:

helm install --name datadog-agent-v1 \
   --set datadog.apiKey=<DataDog API Key> \
   --set datadog.apmEnabled=true \
   --set daemonset.useHostPort=true \
   stable/datadog

A quick explanation of the command:

  • datadog.apiKey is the API key provided to you by Datadog and can be found here.
  • datadog.apmEnabled configures the Datadog agent to run with APM support.
  • daemonset.useHostPort exposes the Datadog agent to the network using the host’s port.

Note! This super-convenient installation does not create a Datadog agent service on our Kubernetes cluster. Instead, it relies on exposing the host’s port.

3. Install the Datadog APM for Python

This one takes a few steps, so be patient.

Start by adding the PyPi packages for the Datadog APM add Datadog SDK to your requirements.txt file. While the Datadog SDK is not strictly needed, we’ll put it to good use.
Load the Datadog APM and connect it to the Datadog agent. Connecting the Datadog APM to agent’s exposed port can be a bit tricky for our use case since we do not know the agent’s IP address or hostname.
Fortunately, Datadog solves this problem nicely in their more mature Datadog SDK with a simple, container-oriented configuration. While we can’t use the same configuration for the Datadog APM, we can reuse the same code:

# Get the Datadog agent’s ip address
from datadog.dogstatsd import route
hostname=route.get_default_route()

# Connect the APM to the agent
from ddtrace import tracer, patch_all
tracer.configure(hostname=hostname)

# Activate the APM
patch_all()

4. Configure Environment Name

The Datadog APM behaves inconsistently with environment variables. Some affect the APM only if they’re executed from command line. Quite often, they aren’t properly documented.
The DATADOG_ENV variable is one such is environment variable, so if we want it to take effect, we must set it manually (copied from here):

if 'DATADOG_ENV' in os.environ:
   tracer.set_tags({"env": os.environ["DATADOG_ENV"]})

5. Add Web Framework Support

To add web framework support, update the patch_all command to the following:

patch_all(tornado=True, flask=True)

6. Fix Call to Request Handler on Finish

Flying colors? Not quite yet. After setting this configuration (which works perfectly!) we encountered an underlying Tornado bug.

The tornado.web.FallbackHandler is the recommended way to use WSGI containers in Tornado applications. However, it did not properly call RequestHandler.on_finish, which the Datadog APM uses for tracing. As a quick workaround, we subclassed FallbackHandler:

class MyFallbackHandler(tornado.web.FallbackHandler):

   def prepare(self):
       super(MyFallbackHandler, self).prepare()
       self.on_finish()

And used it to call the WSGIContainer:

application = tornado.web.Application([
(r'.*', MyFallbackHandler, {'fallback': WSGIContainer(wsgi)})
])

Wrapping it Up

As a DevOps expert, you’ve probably had the sometimes dubious pleasure of installing products. So you know that it can get tricky at times -- in fact, so tricky that you might be tempted to stop the installation and just do without it.

It’s important to remember that the tips, tricks, and workarounds that you develop to overcome these challenges are valuable resources. Be generous about sharing them, and check around carefully for smart tips and tricks like the ones we shared here.

At Rookout, we’re delighted to be working with amazing resources and solutions and will keep sharing the tips we develop to make integrations as smooth and easy as they can possibly be. We look forward to hearing great tips from our partners as well!

Wishing you a smooth integration :)

Still losing hours on getting data from your live code?

No credit card required