Software incidents can be extremely stressful, which can reduce response and recovery times. This post explores ways to reduce this stress, so your team can respond calmly when the unexpected happens.
We’ve all been there. Your site’s gone down, every page returns a 500 and everyone starts yelling. The customers yell at the account managers, the account managers yell at the developers, and the developers yell at each other, trying to pass blame around.
Everyone is freaking out, but nobody’s really communicating. People are either afraid to touch anything, or change things at random without letting anyone know. Someone asks how long until resolution, and nobody knows for sure.
Hours later, things suddenly start working again. Nobody really knows which change fixed it, or what caused the problem in the first place.
Calm Under Pressure
Some people have a reputation for being calm under pressure. One such person is Commander Chris Hadfield, a retired astronaut who flew on the Space Shuttle, commanded the ISS and achieved a note of fame for performing a Bowie song in orbit.
In 2001, Hadfield was embarking on a spacewalk when the defogging solution on his visor somehow made its way into his eye. His eye started to water, and the tears collected around his eye, rather than running down as they would on Earth. Before long, he could no longer see out of that eye. The tears then spread to his other eye, rendering him effectively blind.
Imagine that: you’re hundreds of miles above the Earth, travelling thousands of miles an hour, outside the safety of your spacecraft. I’m pretty sure I’d freak out. But Commander Hadfield kept his cool. Not because he’s a naturally calm and collected person, but because of his training, practicing what to do in a crisis hundreds of times before even getting close to a spacecraft. And because of the people he worked with, who he’d practiced alongside.
Hadfield called down to Mission Control and explained his situation. They then set about trying to find a solution. Finally, they recommended that he vent air from his suit. And he did it without question, trusting them with his life. This caused the tears to evaporate and quickly his vision cleared enough for him to complete the spacewalk.
Even in a completely unexpected situation, Commander Hadfield and his colleagues stayed calm by having a plan, working as a team and building up a wealth of experience through practice.
A Well Handled Incident
Now let’s get down to Earth and look at an example of a well-handled software incident. This story is fictional, set in an e-commerce company, but it does touch on the processes we use at Yext.
Best practices are rarely “one size fits all”, and what works for us may not be the best fit for your organization. So I’ll be focusing on the “why” rather than the “what” of the process.
Anyone can trigger an incident
Alice, a Product Manager, is experimenting with the checkout flow, when she sees a 500 Server Error on the “view order” page. She refreshes a couple of times, but sees it consistently.
Noticing that there have been no alerts triggered, she posts a message in Slack that automatically triggers an incident, which notifies everyone in the #incidents channel.
No matter how many automated alerts you have in place, there’s always a chance something will slip through. Anyone should feel empowered to raise an incident if they see one.
Keep everyone informed
Carol, the Tech Lead on the ordering team, sees the alert and takes on the role of Incident Commander. She pulls in Bob - an engineer from her team - to assist and together they identify a change that is likely to have caused the problem.
Everyone communicates through the common #incidents channel, and keeps everyone else up to date on what they’ve discovered, what has been tried and what has been ruled out. This way, everyone is on the same page, there are no duplicated efforts, and all stakeholders can see that the problem is being worked on.
The first priority for the team is to get into a good state as quickly as possible. In this case, Carol and Bob conclude that this can be achieved by rolling back the bad change. Carol triggers a rollback, freezes deployments and continues to monitor the situation to make sure that the errors cease.
This gives Bob time to work on a forward to fix for the original problem without excessive pressure to move quickly. This reduces the likelihood of further mistakes.
The story doesn’t end when Bob pushes a fix. It’s important to learn from these experiences, encouraging what went well, and improving what didn’t.
The day after the incident, Carol writes a Postmortem document that describes the timeline of the incident and identifies the successful, and less successful aspects of the response. It also identifies several clear action items to be followed up over the next few sprints.
A draft is distributed to the team for review, and following a round of comments, approved. The incident is then added to the agenda for a monthly incident review meeting. During these meetings, all Incident Commanders gather to discuss any incidents that still have outstanding action items to do. This ensures there is follow through and the action items actually get, well, actioned.
Practice, Practice, Practice
The response to this incident was not enabled through documentation, automation or pure luck. It was the product of experience. Not just from real incidents, but - as in our story of Commander Hadfield’s spacewalk - through frequent practice in a safe environment.
This practice takes many forms, but is all intended to build up “muscle memory” so that when a real incident occurs, an engineer knows exactly where to start and how to proceed. There is also emphasis on increasing the pool of experienced incident responders so the absence of a key engineer doesn’t cause any incident to become worse.
Failover and rollback procedures are frequently executed in drills, both to ensure that they continue to work, and to give engineers at all levels experience in running them.
When a real incident does occur, best efforts are made to have engineers “shadow” one another to observe how response is conducted, so that over time, they can take on more and more of the response.
Finally, diagnosis is rehearsed through simulated incidents and “game days”. These situations provide both practice in critical thinking, but also in running through the best practices the company recommends for response. This is the most difficult of the three, as crafting a realistic incident can be time consuming and complex. The key is to ensure you have a chance to run through your complete process, even if the problem in question isn’t super realistic or likely.
The incident response practices we follow at Yext are heavily based on Google’s Site Reliability Engineering book (chapters 12-15). We also recommend “What happens when the pager goes off?” from Increment.
Keeping cool during a crisis is largely down to how well prepared you are. Having a plan and rehearsing that plan frequently can turn a major outage into a minor blip.