Felipe Polo Site

Managing crisis in software development teams

December 17, 2020

It's 10:30 on a warm morning in the ACME offices. Teams have finished their respective standup meetings and all developers have started working on their daily tasks. Some of them have even made a few commits and deployed some new versions of their services.

Suddenly, the A team receives several messages from Slack; one of their customers is freaking out since many users are complaining about a login issue. Nobody can log into the application, it's peak time and 70% of sales depend on days like this one.

First rule: show the right meta-skills

Gustavo Razzetti defines a meta-skill as “a master skill that magnifies and activates other skills. A meta-skill is a high order skill that allows you to engage with functional expertise more effectively. It’s a catalyst for learning and building new skills faster.”

For me, a meta-skill is just an attitude that should cover your entire performance. To manage this crisis, the A team will need to act promptly, but also calmly, confidently and with high levels of communication. Without these meta-skills, the situation could get much worse.

Let's see what else they can do to deal with this crisis effectively.

Second rule: bring the right people in

After getting together and acknowledging this as a Priority 1 issue, the whole team stops working to focus on the problem. James, the Team Lead, calmly decides to debrief the team on what is going on, explaining:

the symptoms that are known
the impact on the users and the business
the potential risk of an even worse situation

Knowing that too many people could cause too much noise, and too much noise could lead to leading the team into the wrong path (or even worse, to dispersing the focus of the investigation), James requests only the members of the development team that have been working in the affected area to join the crisis team and start investigating. Everyone else should continue with their usual work.

Third rule: communicate x10

James knows that a crisis can create a lot of anxiety for everyone. He also knows that being extremely communicative in these cases helps to reduce the levels of stress, since stakeholders:

are aware of the progress
can assist with anything that is needed
can communicate the status to their respective stakeholders and teams, reducing the overall anxiety in the organisation and diminishing the pressure on the development team

So he goes and decides to create a dedicated channel in Slack called war room, inviting the engaged developers plus the related stakeholders.

Fourth rule: put out the fire

If your house was on fire what would it be the first thing you'd do? James knows the first thing his team needs to do is to apply a quick (and even dirty) fix. This is important - even if the main problem and root cause are still around - to hold back the impact on the business.

After doing a bit of research one member of the team spotted in the logs a repetitive error: one of the services was unable to cope with the load and the database CPU was at 99%. They decided to do three things:

Dimension the database temporarily to buy them some time. Errors went away, but scaling a database indefinitely is not an acceptable solution, and it hides the real issue.
Switch on a feature flag they had previously developed to make a banner appear informing the users about the current issue and the team involvement on its fix.
Inform of all their discoveries and their plan to stakeholders

The main pressure seems contained, but James and his team know they need to keep pushing to get to the root cause of the problem. The issue (or even new ones!) may come back sooner or later.

Fifth rule: find the root cause

In disciplines like software engineering, it is very important to discern whether a solution is a final solution, a quick patch or a hack. Quite often, symptoms may be caused by one or more hidden issues. James knows that not understanding the problem correctly will lead to the wrong set of solutions, and the issue may persist.

He and his team try to replicate the same behaviour in a staging environment:

They ensure the same code version runs on those servers
They set up a similar computing capacity to production
They add the same data load as the one it was experienced during the outage.

They are finally able to achieve the same outcome: service crashing and database CPU going to the roof. Bingo.

They are now in a good position to start debugging. Finally, they discover there are some missing indexes in some tables in the database. They apply the fixes in staging and verify the issue goes away. Now it's ready to be deployed in production, inform both users and stakeholders and move on to the final and probably most important phase.

Sixth rule: learn about what happened

A team is not strong because it avoids failure, but rather because it can learn from it, fix it and prevent it in the future. Without failure, there is no growth.

It is essential after an incident like this one to allow James and his team to take some air, celebrate the resolution and spend some time analysing what happened and why, in order to strengthen the software system, but also to challenge the development processes, tools and methodologies. A production incident rarely happens for a single reason.

James and his team celebrate a post mortem meeting, where they can reflect on a safe, blame-free environment and what happened. They realise that:

They weren't doing load testing
There was no pull request created when those changes went live
It took too much time to increase the database capacity
There were some missing unit tests
There were some missing logs that would have been really useful to diagnose the issue sooner
There were some missing monitors and alerts that would have warned about the issue building up before users could have noticed
Users complaints took too long to reach the development team

This is a really useful and productive outcome from the post-mortem meeting. It allows the team to identify underlying issues of a different nature. Some issues are related to the technology itself, but some others are related to failing processes, inconsistent methodologies and lack of communication. Everyone should feel accountable for fixing this, not only the development team but also the stakeholders.

After the incident is resolved and all lessons learnt, the team comes back to regular work with their backlog increased by the new tasks generated as a result of the learning process. The system will get stronger thanks to this.