One of my tasks as a CTO is running post-mortem meetings after we have an incident or outage. This is an extremely important step toward making progress in system stability and performance. Teams that don't do post-mortems miss the opportunity to get ahead of system issues.
Post-mortem meetings take some finessing, they can easily turn into a blame-game, or he-said she-said. So let me layout how I run post-mortem meetings, and how that makes them most effective.
First, important rules for post-mortems:-
1) Timely to issue (next day is best)
2) All relevant members present (no meeting if someone is missing)
3) Impartial moderator
4) Empty whiteboard to describe incident/issue
5) There is no blame
It is critically important to have "No blame", as you will make no progress otherwise. You need to acknowledge that everyone is working hard, systems can fail and its nobody's fault, and that openly discussing the issues together is the best road to ultimate resolution and system growth.
Now, even if your team knows "no blame", they will probably still be on edge at the start of the meeting. Its natural, failing systems create pressure. The team may also be avoiding dealing with the issue, hoping it will go away, and you may have to bring them back into it. What I find in post-mortems is that teams try too quickly to get to a solution. Don't let them, instead have them focus on a timeline of events.
I've found that starting the meeting with a chronology of events is extremely effective. I ask the team "what happened", and "when", and make them be exact about the when (i.e. 5:14pm), and I transcribe it all to the whiteboard. We include communications and hand-offs in the timeline, and any other information we collected at the time or gleaned later from system logs. Something about the focus on an exact time-line gets everyone to focus as a team. Maybe its because it turns us into detectives examining someone elses problem, not ours (if anyone has a better theory as to why, let me know).
After an hour, we usually have a list of immediate, medium, and long-term actions to take to remedy the issue. From that, the candidate cause/solution usually stands out, and we make sure we have alternate solutions should our candidate be wrong. These are captured on the whiteboard, and we make sure have both mitigations and solutions (making a problem go away is sometimes as good as fixing it).
Tuesday, February 26, 2008
Effective post-mortems
Friday, February 15, 2008
Email wrong-number
Sometimes technology can have significant and unintended cultural impacts. Email in its infancy was a minefield, in that we never realized it was an inappropriate (and one-sided) tool for expressing emotions. It took us a while, but we finally taught ourselves to reread and pause before hitting the send button. In Japanese, there is a word for "unsay" to take something back, but alas not in English, nor is there a reliable unsend in email.
I've been using a Blackberry for phone and email for a while, and I've noticed an interesting phenomenon. I will call it the "Friendly wrong number". I meant to call Wendy A, but instead called Wendy B, since I dialed from my email address book and they were adjacent. "Hi, Wendy?" "Yes, who's this?" "Its Jon" "Jon? Oh, we haven't spoken for a few years, You still working at...?". Damn, wrong Wendy. But here's the kicker. I know Wendy and can't just say sorry, wrong number. This could get dicey if your boss is "Jamie B", and your best buddy is "Jamie C".
A variation is the call-back from someone you just spoke to, but by mistake. Some new phone feature must be causing that, I am guessing. Another variation is in email, the dreaded auto-fill feature in Outlook.
This makes me wonder what unintended future impacts technology will have. Many of my LinkedIn contacts have photos, and it won't be long before they also show on phone and email messages on my Blackberry. Will the prevalence of GPS maps make us lose our orientation, and we'll need to carry compasses with us where ever we go (Compass, what's a compass?). I guess we'll find out.
Wednesday, February 13, 2008
Worthless processes
When reviewing a write-up or new process/project proposal, a filter that I will apply is "Does this make a difference?".
Let's say one of your team was asked to write up a plan for "managing execution risk", and they wrote a document describing a process for doing this. After you read the document, you decide that while it all makes logical sense, it really will not make a difference. For example, tracking how many lines of code are written each day makes no difference. Who cares? Its business outcomes that matter.
As a CTO, it is your job recognize tasks that don't make a difference, and tell your team that they can stop doing. If something is worthless, say so. Or go one step further, and ask your team to tell you tasks they believe are not useful or optimal. Think of it as spring cleaning. Some tasks are not negiotable and are in fact important for compliance reasons (ie Sarbanes Oxley). But if its internal driven, reexamine it. Our teams are so busy, they'll appreciate the time savings.
So, what processes/tasks are your team doing that you can throw away?