This is based on the "Blameless Portmortem" IT document, but has been modified for use in fire and emergency services.
- The Incident Owner schedules a meeting. See "Who should we invite at the meeting" below for suggestions. Invited people must prioritize this meeting over all else (notable exceptions: other incidents in prods and interviews). Schedule the event as early as possible so it is fresh in memory.
- Share the meeting info publicly for anyone interested in attending. The purpose of post-incident analyses is to spread learning, after all!
| Blameless post-incident analysis | Notes |
|---|---|
| What mistakes were made? | |
| What made one of us think they were doing the right thing as they were committing a mistake? | |
| The Five Whys tree | |
| What has this incident taught us? | |
| What must we change, as a highest priority, in order to prevent this kind of incident from happening again? |
- The Incident Owner shares their screen and shows the Incident Page. At the top of the page, create a new section for the post-incident analysis. For convenience, you may copy/paste the content of this page to use as a template.
- Elect a note taker for the meeting who will fill the blanks in the template above. If on one volunteers, the Incident Owner has to be the note taker.
- Discuss what factors contributed to the outage:
- Avoid settling on a single root-cause
- Ask "Why?" as much as you can to understand each factor's cause & effects
- Listen to your Error Experts
- Write a section for what this incident taught us
- Write a section for corrective actions
- Try to eliminate the identified risk factors that led to this incident
- No "nice-to-haves"
- Each action is top priority for its assignee when the post-incident analysis meeting is over (otherwise, it's a nice-to-have)
- Delete actions not assigned to someone when the meeting is over. This is important.
- Make a public announcement about the new blameless post-incident analysis document when it is complete.
Preferably, we want to invite someone to the meeting if they:
- were involved in decisions that contributed to the problem
- identified the problem
- responded to the problem
- diagnosed the problem
- were affected by the problem
- manifested interest in attending the post-incident analysis meeting
For people invited, this meeting should be a higher priority than any other meeting/event, except for another incident or interviews. A manager should never give someone a hard time for making a post-incident analysis their top priority. Feel free to copy paste the paragraph above in your post-incident analysis invitations
- Any incident that resulted in the loss of life.
- Any incident which involved a failure of the Incident Command Systen.
- Any incident which involved a significant loss of property.
- Any incident that negatively affected fire company objectives (good service, community engagement, etc)
The goal is to better educate staff on how to prevent a specific issue from happening again. As we get better at seeing and solving problems, we must decrease the threshold of what constitutes a problem to keep learning. Doing this amplifies weak failure signals (which is a good thing).
In the fire service, we lack the inate ability to differentiate between being lucky and being good. When we have a high profile failure, we owe it to the citizen we protect to learn from it. We get better and safer as firefighters by automating organizational learning. Organizational learning requires people who made errors to be enthusiastic in helping others avoid the same error in the future. Blameless Post-incident Analyses cannot be automated. They must become a habit. It is important that we learn to do them right, so we can accumulate a library of knowledge on how failures happen at our company, and the decisions we took to solve these inherent problems. By removing blame, you remove fear. By removing fear, you enable honesty. By enabling honesty, you enable prevention. When you succeed, we succeed.
Your role is to:
- Ask to record the meeting, it will be a gold mine of information later.
- Create a new page in Blameless Post-incident analyses with a title like "YYYY-MM-DD-Incident Title."
- Take notes throughout the incident in the page you created (see the "Incident note taking example"). Focus on:
- Reconstructing the timeline of what led to the incident in the present moment
- Observations (document in the incident document whenever possible)
- Actions performed, by whom
- Be the last person to exit the incident meeting.
- Schedule the blameless post-incident analysis that will take place after the incident is resolved.
Your role is to:
- Don't feel bad about the role you played in the incident. Incident are unplanned investments.
- Share as much information you can about any mistake you made that might have led to the incident. Try to raise the question: "Why did I think I was doing the right thing as I was performing my mistake?"
- Educate people about your mistake so that no one else has to make it again.
| Timestamp | Notes |
|---|---|
| 2023-10-31 8:30 AM | Slack message sent from @AlanWake who noticed anomalies in production |
| Link to initial Slack message from Michael to inform of the situation Link to NewRelic logs showing problem in prod | |
| 2023-10-31 9:00 AM | @felleg Performed a rolling update of the servers in the canadacentral location. The update failed (screenshot here) |
| 2023-10-31 9:10AM | Created meeting room to address incident. People present: @felleg (incident owner), @barb (Commands executor), @charliechaplin, @goofie, @captainamerica |
| 2023-10-31 9:15AM | @Justin Noticed that the build pipeline of the latest version of C1 (version 3.1.146) succeeded with warning Host undefined: will fail in production, link to log |
| 2023-10-31 10:02 AM | @Justin fixed the host underfined error (link to PR), this was due to a new feature introduced in 3.1.145 that had not been tested yet. To address in post-incident analysis. @goofie Approved the PR and merged to master |
| 2023-10-31 10:17AM | New build passed without warming, all green (link to pipeline logs) |
| 2023-10-31 10:22AM | @goofie new rolling update attempt to deploy to prod, success (screenshot) |
| 2023-10-31 10:30AM | NewRelic logs indicate incident is solved. (link) |
| 2023-10-31 10:35AM | @felleg closed the incident meeting room. Schedule post-incident analysis on 2023-11-01 at 10:00AM |
This page was inspired by The DevOps Handbook p. 308-310 and 418-419 (2nd ed).
This is based off of "Blameless Postmortems".