When creating alerts, there are many factors to consider. What metric will be alerted on? At what point do you trigger an alert? How long should the metric be outside of the alerting range before de-escalating. But until recently, I’d not put much thought into where the alert is sent. But the importance of choosing the destination for your alerts carefully was brought into sharp relief for me last week.
The Pages product consists of a number of different microservices, some of which are modified frequently, and some of which are simple enough to have been left alone since before I joined the team over two years ago.
Many of these services perform batch processing jobs, receiving work over RabbitMQ. To make sure these services are running smoothly, we have alerts configured in Revere to notify us when the job queues are being processed too slowly and backing up. These notifications were previously all sent via e-mail to a team mailing list. However, following introduction of a bug that slowed down processing of the main group of jobs, the team started receiving repeated alerts from multiple queues as they went in and out of a “bad” state. Noisy alerts become very easy to ignore, so once the underlying problem was fixed, we decided to try moving these alerts from email to Slack.
While this was mainly done just to alleviate some of the load on our inboxes (I personally hate to see the count of my unread messages), what happened next surprised me a little. A day or two after making the switch, the alert started firing for a service that we hadn’t touched in very nearly three years.
Within a few minutes, several members of the team got together and started working through the code of the offending service, trying to identify what was going wrong. Without any direction, they effectively formed a tiger team to fix the alert. This kind of rapid response had never happened with email alerts. In fact, it turned out that this particular alert had fired a couple of times during the previous few weeks, and had been effectively ignored.
There seemed to be a number of possible factors for why this alert had previously been ignored. A major one is that it is very easy to set up filters in Gmail. Most of the team had these alerts going into a label that can be easily ignored until the count of messages is particularly high.
There may also be a psychological effect in having an alert appear in a shared channel. Even though we all knew everyone else got the emails, there’s something more communal about a chat channel. Knowing everyone else can see exactly the same thing makes it a little easier to dive in and feel confident that your ideas or questions about the problem will be seen.
Another benefit is the increased visibility that Slack provides us. Anyone can join our team channel with just a few clicks, and see a full history of all the alerts we’ve been receiving. Knowing that with a couple of clicks the CTO can see how serious you are about handling alerts can keep you on your toes a bit.
The actual reasons may be all of these, and it may be none of them. But Slack seems to be working as an alert mechanism for us right now. When creating new alerts, I’ll certainly be putting a lot more thought into where they should be sent to be most effective, and I might even experiment a little with alternatives.