At Yext we store most of our metrics in Graphite. We’ve built internal libraries that wrap StatsD to provide our developers with an easy way to instrument their code and record metrics. Additionally, over time we’ve built a variety of daemon processes which scrape metrics from key pieces of our infrastructure such as HAProxy and RabbitMQ. With all of these application and infrastructure specific metrics in Graphite we’ve been able to create tons of insightful dashboards with Grafana. While staring at pretty graphs delights us, nothing pleases us more than being able to resolve production issues before our users are impacted by them. We realized that in order to achieve this we required an alerting system that would:
- Leverage existing monitoring infrastructure
- Provide meaningful and actionable alerts
- Have a low friction, simple UI for users to configure and manage alerts
Announcing Revere, a tool that aims to do this for medium-sized microservice architectures. Revere was developed at Yext in Go and is now being open sourced where it will continue to grow to the community’s needs.
Why another alerting service?
When we began the development of Revere we looked at the alerting landscape and found a wide variety of tools. There were older tools such as Nagios which proved to be cumbersome to write checks for and not the simplest to use. On the other end of the spectrum we found some newer tools like Prometheus which had begun to garner a strong following as an all around monitoring system complete with its own expressive language and data store. While the feature set is certainly impressive, our goal was to find a lightweight solution to plug into our current infrastructure.
A second requirement we wanted was to effortlessly support our evolving microservice architecture. Our alerting system needed to support templating, in order to provide standard alerts for the many standard metrics we record for each service. We wanted it to automatically setup up new standard alerts whenever our engineers create a new service. And yet even within system-wide templated monitors, we wanted alarms to be manageable at the service level, so that silencing known problems in one service wouldn’t mask alarms for new problems in other services.
Revere aims to solve these microservice alerting problems and play well with other existing pieces of our stack. The first release comes with support for Graphite threshold alerting.
High level overview
Revere consists of two main processes. The first is a web interface that allows users to create and configure alerting rules for their metrics. You can also view ongoing alerts, their history, categorize related alerts with labels and create silences if necessary. The second is the daemon process which monitors the state of the metrics and delivers alerts. It also responds to configuration changes made with the web process. These configurations are stored in a MySQL database.
As a developer wanting to use Revere, I would begin by creating a monitor within the web interface. A monitor contains all of the information required to watch a set of metrics and generate alerts. This set of information includes metadata to describe the metrics being monitored and tips on how to respond to the generated alerts. The README provides a full breakdown of the different fields.
Additionally, it contains a probe, the component responsible for querying a datastore for metrics and classifying the streams of data into one of five states:
- Normal - Things operating as expected
- Warning - Something odd is happening
- Error - Something is broken
- Critical - There is a serious problem
- Unknown - An error was encountered when obtaining the metrics
Currently, Revere ships with the Graphite threshold probe. This probe queries Graphite for time series data and classifies it into one of the states using thresholds provided by the user. You can also view a graphical preview of the metrics while configuring.
Lastly a monitor contains triggers, the components responsible for determining when to send an alert. Associated with each is a target, the destination for all the notifications that are issued. Revere currently supports email notifications as a target type with more in the works.
With monitors in place, alerts will be visible on the active issues page. From here, a user can explore the history of specific data streams, create labels to categorize monitors or silence issues when required.
An example of a monitor we find valuable is our HAProxy Backend Down monitor. This monitor alerts us of any services that have not responded to health checks issued by our load balancer. Teams can set up alerts for specific services that they care about with triggers directly on this monitor or on labels owned by the team that are associated with the monitor.
We’ve found Revere to have been useful and hope that it can be of use to the community. If you’re still alert after reading this post, head over to the README where you can find some quick start steps and a more detailed look at the different components.