Monitoring local Kubernetes services with Tilt

A system that you want people to use should be easy to use correctly. For infrastructure, that often means consistency in addition to well-designed interfaces. If a production environment is consistent with development, you can remove an entire class of errors that result from inconsistencies and catch more errors in development before they can cause harm in prod. Unfortunately, unless there is an explicit attempt to keep production and development consistent, they will start off different and drift further apart.

This blog post is about my journey to overcome the default state of the world, and make monitoring consistent in Kubernetes between production and development environments. Once I saw consistency, I was immediately faced with another challenge: everything needs to be easy to use if you want adoption from your peers. “Easy to use right” doesn’t just mean “hard to use wrong”. With the help of Tilt, a Kubernetes development tool, I managed to make it easy and drive adoption.

The Call to Adventure

This story starts before I realized that consistent monitoring environments were something I wanted. Many months ago, an engineering team in charge of publishing webpages at Yext started reporting failures. My team owns the server the errors were coming from, and once they contacted us it became clear very quickly that there was a mystery at hand. These errors were real, but they were invisible in our dashboard.

My team had already noticed similar errors being reported. We investigated and determined that the errors were caused by a backend system running out of memory and restarting. We triaged with the assumption that they were infrequent and weren’t blocking any work. Both assumptions turned out to be incorrect. Our dashboard, which had been designed to provide snapshots into the health of the system, was completely ignoring the server most impacted, a server making many requests to the restarting backend. The end result was that we were unable to publish updates to the webpages it was responsible for.

We have a graph that tracks errors seen, and color codes them based on the server they originate from. Here is the graph we saw.

WithoutErrors

Here is the same graph with the most affected server included.

WithErrors

With the scope of the errors revealed, it was clear we needed to take immediate action. We increased the memory of the backend system, and got to work improving the memory performance of the backend server.

So what happened? How did a dashboard designed to provide a holistic view have such a large blindspot?

The first and most obvious answer is a classic software blunder. The servers being monitored were selected from a hardcoded list. To save time, I had made an assumption: The servers using our publishing service won’t change over time, and if they do, this list will be updated. Unbeknownst to me and my team, new servers had been created, creating a blindspot.

The immediate fix to our blindspot was as easy as it is obvious. Don’t hardcode your servers! We replaced the hardcoded list in Grafana with a query, and our dashboard lit up.

Everything is working! It’s tempting to think that our work is done. After all, now you can add as many servers as you want, and they will appear in a central dashboard. However, another question remains. How did we get into this situation in the first place? Are there more pitfalls ahead, just out of sight?

Which brings us to the underlying problem: it can be anywhere from difficult to impossible to test your monitoring infrastructure locally. I hardcoded server names because I didn’t have an incentive to have my graphs work for non-production environments. If there had been a best practice of developing graphs locally, I would have seen the need for a query much sooner.

The Crossing of the First Threshold

A little while later, I found myself working on greenfield development for a Kubernetes project. Specifically, I was tasked with setting up initial monitoring and designing a system that would enhance visibility into the health of services.

This was my chance! I could create an environment where developers can use monitoring in local development, in the same way they would in production! My journey towards consistent infrastructure had begun.

The Road of Trials

At first glance, this seems like a dream task for a Kubernetes cluster. The version to use for a tool can be specified in a Docker image. Configurations can be specified with Kustomize, and loaded in through configmaps. Kubernetes clusters can be run locally without much effort.

ExampleQuotesLoki

Before too long I had a prototype up and running. My Kustomize directories included multiple related yaml files for convenience.

Here is what the kustomization file for Prometheus looks like.

resources:
 - kube-state-metrics.yaml
 - prometheus.yaml
 - prometheus-operator.yaml
 - prometheus-web.yaml
 - alerts.yaml

All that other developers needed to do run monitoring locally was to follow these few simple steps:

  1. Verify that you are in the right cluster
  2. Launch Prometheus
    • kubectl apply -k /path/to/kustom/prometheus
  3. Launch Loki
    • kubectl apply -k /path/to/kustom/loki
  4. Launch Promtail
    • kubectl apply -k /path/to/kustom/promtail
  5. Launch Grafana
    • kubectl apply -k /path/to/kustom/grafana
  6. Port forward grafana
    • kubectl port-forward -n monitoring svc/grafana-service 3000

My achievement was not met with the excited fanfare I was expecting. It didn’t get a response at all. No one seemed to dislike it. It was worse than that: no one had tried it at all.

The Vision Quest

I was missing something important. My vision of Kubernetes developers monitoring services locally was looking like a far off fantasy if I couldn’t get anyone to even try it. I would tell someone that it would work great for their use case, and all I would get back was a noncommittal shrug.

I had achieved my goal, and reached the end not with a bang but with a whimper. The whole project was hollow without use, destined to be lost and forgotten. It was in this moment that I learnt an important lesson about making utilities for others: if it’s not easy to use, it might as well not exist. Never use six steps when a single step will do.

The Boon

So I looked for ways to simplify the deployment of Kubernetes services. As luck would have it, my manager had been looking into the same thing, and pointed me to Tilt.

Tilt makes it easier to manage deployments to Kubernetes by managing more configurations than you would put in a single directory. This is enormously helpful, as it allows related sections to be deployed together, while being separate in their configurations.

Here we see Tilt’s UI, allowing us a single location to view the status of our many Kubernetes objects.

TiltBrowser

The Magic Flight

Armed with Tilt, we are able to reduce the previous six steps, repeated here for convenience.

  1. Verify that you are in the right cluster
  2. Launch Prometheus
    • kubectl apply -k /path/to/kustom/prometheus
  3. Launch Loki
    • kubectl apply -k /path/to/kustom/loki
  4. Launch Promtail
    • kubectl apply -k /path/to/kustom/promtail
  5. Launch Grafana
    • kubectl apply -k /path/to/kustom/grafana
  6. Port forward grafana
    • kubectl port-forward -n monitoring svc/grafana-service 3000

Step 1 was previously manual work that might be accidentally omitted. Tilt includes utilities such as allow_k8s_contexts, which restrict Tilt from applying configurations to remote clusters.

allow_k8s_contexts(['docker-desktop', 'docker-for-desktop', 'minikube'])

Steps 2-5 are the bread and butter of Tilt through k8s_yaml and kustomize functions.

k8s_yaml(kustomize('./path/to/kustom/prometheus'))
k8s_yaml(kustomize('./path/to/kustom/loki'))
k8s_yaml(kustomize('./path/to/kustom/promtail'))
k8s_yaml(kustomize('./path/to/kustom/grafana'))

Step 6 can be done automatically in Tilt with a k8s_kind to tell tilt about the CRD, and a k8s_resource to specify the port forward.

k8s_kind("Grafana")
k8s_resource('grafana', port_forwards='3000:3000', extra_pod_selectors={'app': 'grafana'})

With a Tiltfile for monitoring, six steps have been combined into one.

  1. cd $RESPOSITORY_ROOT && tilt up monitoring

Balance Between Two Worlds

Now when I tell people to try out local monitoring, they follow through and tilt up. Improving usability has made a huge difference. Only time will tell if enabling developers to run monitoring infrastructure locally will result in better dashboards. Until then, I will continue crusading to make local monitoring as easy as possible.