Tuesday, February 26, 2008

Effective post-mortems

One of my tasks as a CTO is running post-mortem meetings after we have an incident or outage. This is an extremely important step toward making progress in system stability and performance. Teams that don't do post-mortems miss the opportunity to get ahead of system issues.

Post-mortem meetings take some finessing, they can easily turn into a blame-game, or he-said she-said. So let me layout how I run post-mortem meetings, and how that makes them most effective.

First, important rules for post-mortems:-
1) Timely to issue (next day is best)
2) All relevant members present (no meeting if someone is missing)
3) Impartial moderator
4) Empty whiteboard to describe incident/issue
5) There is no blame

It is critically important to have "No blame", as you will make no progress otherwise. You need to acknowledge that everyone is working hard, systems can fail and its nobody's fault, and that openly discussing the issues together is the best road to ultimate resolution and system growth.

Now, even if your team knows "no blame", they will probably still be on edge at the start of the meeting. Its natural, failing systems create pressure. The team may also be avoiding dealing with the issue, hoping it will go away, and you may have to bring them back into it. What I find in post-mortems is that teams try too quickly to get to a solution. Don't let them, instead have them focus on a timeline of events.

I've found that starting the meeting with a chronology of events is extremely effective. I ask the team "what happened", and "when", and make them be exact about the when (i.e. 5:14pm), and I transcribe it all to the whiteboard. We include communications and hand-offs in the timeline, and any other information we collected at the time or gleaned later from system logs. Something about the focus on an exact time-line gets everyone to focus as a team. Maybe its because it turns us into detectives examining someone elses problem, not ours (if anyone has a better theory as to why, let me know).

After an hour, we usually have a list of immediate, medium, and long-term actions to take to remedy the issue. From that, the candidate cause/solution usually stands out, and we make sure we have alternate solutions should our candidate be wrong. These are captured on the whiteboard, and we make sure have both mitigations and solutions (making a problem go away is sometimes as good as fixing it).

No comments: