This past week I took my first stab at conducting a blameless postmortem on an outage my team experienced recently. The postmortem was on an outage of a new tool we had brought in to help our developers (which will remain nameless). The details of the outage I will not share, but what I can share was some of the obstacles that needed to be overcome with my team, and myself to in this blameless post-mortem.
I am actually pretty underqualified to be a facilitator, I have only just begun to learn about resilience engineering and safety and human factors and all of my learnings on blameless are from books, this blog by John Allspaw, and John Willis’ EDX course on DevOps. So I kinda went into this pretty uncertain on how it would end up.
I was not actually the one to initiate the meeting, it was organized by my teammate who was an engineer on the team. I was also barely involved with the outage as I had some other work at the time (and I am a business analyst and I was not involved with the technical implementation), so I was almost entirely unaware of the events that transpired. We had a pretty solid understanding of the root cause of solid evidence. What we needed was derive solutions to prevent future outages.
Going into the meeting, my manager and my coworkers were on the call (this was on a call as we are co-located and could not meet in person so this had to be done over audio). I was then asked to facilitate the meeting, which I was kinda hoping would happen as I really wanted to try out what I have been studying. Before the meeting, I had sent some materials for the team to check out on blameless and sent John Allspaw’s Thesis on Outages, his blog on blameless, and I had also decided to include some materials on the Theory of Constraints Thinking Process which also supports blameless culture.
To begin with, I explained my understanding of blameless and why it is important.
This is how I explained it:
“Blameless is an important way to approach this outage not just because of the being kind and that kind of nobleness, but because it helps us be more scientific in our approach. If we quickly assign blame to a person or team, we may be missing the real reasons this outage happen and the environmental factors that lead to the outage. So I want to set some ground rules, we cannot say the root cause is a person or another team.”
To this my manager, who has also begun to study blameless, had some push back and an important question:
“Okay hold on, I don’t want to do that because accountability is important. What if it really was negligence or someone not doing a good job? They need to be held responsible for their actions.”
This made me think, he’s right, people do bad things and that can’t go unpunished, but I also know that I am looking for people. Goldratt says in The Choice, people are good. So from that, how do we approach this problem?
I gave this analogy as a response:
“Okay so we know that there are people who do bad things, but what we are looking for is true root cause, not retribution. So say we have a murderer, for us we want to know what lead him to commit murder. We want to shift the blame from him being a bad person to deeper questions like, ‘What was the situation at the time?’, ‘Who was there and what were they doing?’, ‘What we’re the events leading up to this?’, ‘What was his childhood like?’, and so on. If we stop at punishing the criminal, then we will have done nothing to change the environment that contributed to his crime and we might continue to create murderers. This not to say that the murder doesn’t need to be punished or that he doesn’t need to be removed from society, it’s that these actions aren’t the goal”
“I see what your getting at.”, replied my manager with some reluctance in his voice, “It’s like this Netflix Show Mind Hunters where they try and figure why criminals commit their crimes.”
“Yeah! I mean I have never seen the show but that’s kinda the idea” I replied, kinda hoping I could continue without having to explain or justify more as time was ticking away.
“Okay, I’m willing to give it a shot”, my manager agreed.
Now the hard part, I literally had almost no idea where to start so I turn to Allspaws blog post and I find this list…
Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:
- what actions they took at what time,
- what effects they observed,
- expectations they had,
- assumptions they had made,
- and their understanding of timeline of events as they occurred.
…and that they can give this detailed account without fear of punishment or retribution.
I see that observing effects is near the top of the list and this reminded me of the Theory of Constraints thinking process where they start with listing the observed negative effects as a place to begin understanding problems, so I decided to start there.
My aim was to build a clear picture of what we saw at the beginning; hopefully, this would help us understand clearly what we are trying to solve and, now as I write this blog post I am realizing, to also help us understand what our failure signals were.
There were some parameters of what count as legitimate negative effects and what do not. What do not count are negative effects associated with a perceived solution that is not implemented. An example might be, “Not having to monitor in place” – this would not count as legitimate negative effect, since this does not describe the condition at the time and distracts from the situation in that moment. A better way to state this same observation would be: “Undetected failures in the network or applications”. The purpose of this is to avoid chasing solutions before truly understanding the problem, just like the whole spirit behind blameless.
I decided to also add the lean coffee approach, so I time boxed us for 5 minutes to talk about observed negative effects, knowing I would add time if needed. We then began listing observations. This was new for all of us, so we only had about 3 observed negative effects. Instead of pushing the team to force them to think of more, I decided that we would move on and be satisfied with what negative effects first came to their minds.
I then asked them “What expectations did we have” or “what expectations we had.” This seemed a little silly at first since “not having an outage” seemed pretty obvious. But I said/countered that it is important that we are clear about what we and our customers expect. Listing our expectations helps understand our own biases at the time and what helps explain why things were missed. Again only focusing on the situation at that time, letting go of what “should” have been the expectations at the time.
I then asked what assumptions they had. This helps us understand our expectations and our own perceptions. This was even hard for me to define, so we had only a couple. Assumptions can be vague and often hard to distinguish between expectations and can be perceived as reality. In hindsight I could have waited to ask this question after talking through the events that transpired. After seeing the hard facts laid out, false assumptions are easier to see in my opinion.
After we discussed what assumptions we had, we moved to building a timeline. The outage was a couple days before the meeting so some details were fuzzy. Fortunately, my team had posted a detailed post-mortem of the problem and the final solution publicly so that helped give us a starting place. We worked through the steps of how the outage was discovered, what we learned, what we missed and how we managed our outage.
I thought this would be the easy part, but I found that being my team honest about decisions they had made. For example one team member mentioned, , “I actually started to drudge up feelings judgment that I need to face in myself. I felt like “Isn’t it obvious that you should put those things aside while we are having a service outage?” but that was blame, and I thought that if I am having these thoughts, others might be too. So I decided to say, “Just to be clear, it is immensely important that we don’t blame each other or ourselves for any perceived mistakes, we are trying to learn right now.”
As we discussed the timeline we discovered more assumptions we had, assumptions other teams involved had, other negative effects, and expectations we observed. It crystallized a lot of things that we didn’t realize when we started.
We worked the timeline out in about 20 minutes. We didn’t get into a great deal of detail, but we hit on what the team felt were key points. We then moved to discussions on to discussion solutions.
There we already some solutions in mind but, interestingly, some solutions I had heard at the beginning of the meeting we nowhere to be found at the end. This led us to talk about things we could do to work better with our tools and our vendors, and how we can better prepare for production implementations.
This also led to some discussions of other issues we have discovered in our environment. It appeared some of this portion of the discussion is what the engineers really wanted to talk about, and some of the postmortem was kind of a thing that they patiently waited through. It felt like maybe it was a little bit of overkill to them. This is of course only my assumption.
In the end, we did not find a root cause that was associated with a failure in our behavior nor the behavior of others. We did, however, find solutions that changed the way we approach the process of implementation and new quality checks that we thought would help. It was a little rough and but I did learn a lot.
Here are my takeaways from my first time being a facilitator:
- Explicitly asking what were the negative effects/assumptions/expectations was not as effective for finding answers as I thought. It put people on the spot and with the outage having been a couple days prior, the memories are no longer fresh. Racking their brains takes time and effort.
- Working on the timeline was the most interesting and interactive part of the experience.
- People want to get right to solutions, getting people to let their ideas go is not easy. I was fortunate to have a willing team, though, and I am grateful that they let me run with my ideas.
- Timeboxing is good, but difficult on a phone call. In the future I will likely use a video chat and/or screen share.
- Blame and judgment are really natural emotional responses; even with all my study, readings and learnings, it was far harder to separate my natural judgments from what I was observing and hearing.
- Once all the facts are out, it’s easier to find solutions. Because my team was willing to put finding solutions on hold for a bit while we we’re able to focus on observations , deciding next steps took very little effort.
Lessons for next time:
- Next time I will start with a timeline and ask for the negative effects, expectations, and assumptions for each event my team comes across. Observing that my team had an easier time thinking of what they experienced while discussing the timeline, I think that the timeline is a better place to start.
- I am keeping the explicit ground rule that negative effects cannot be associated with potential solutions and that root cause cannot be a person or team. This really helped drive the discussion away from people and more about what the conditions were at the time.
- I need to read more about blameless and resilience engineering. I have recently purchased Field Guide to Human Error by Sidney Dekker to help me with this.
- Lastly, writing something like this after the postmortem helped me learn a lot from the experience and I will make this part of the postmortem experience going forward. A postmortem of a postmortem if you will (LOL).