-
Last week I went to the DevOpsDays Seattle conference. It was a great community event and very well-run, and I got to reconnect with a bunch of old friends and make some new ones. The best part of the conference for me was John Allspaw‘s talk on “Taking Human Performance Seriously in Software.” As a teacher of human skills to technology teams, this topic really caught my attention, and I was not disappointed.
John, previously CTO at Etsy and now part of the consulting firm Adaptive Capacity Labs, is part of a consortium looking at human factors in incident response. The results of this work have been published as the Stella report. It’s a detailed look at the process and human dynamics that happen when a critical incident strikes an organization, using several case studies. Among many other great insights from the report, these stood out to me:
- Distributed computing systems are too complex to allow for an accurate view of what they actually are.
- Each team member brings a unique–and different!–mental model of the system they are working with.
- When a system anomaly appears and generates a critical incident, the process of troubleshooting and resolution involves negotiating these mental models to arrive at a sufficient shared understanding of the anomaly to address and resolve the problem.
- Nevertheless the system remains inscrutable, and more anomalies will appear without fail.
- One of the key organizational capabilities for dealing with this situation is a robust post-mortem process that enables maximum learning–not to make anomalies disappear (that’s not going to happen) but to increase organizational capacity for facing the unknown together as effectively as possible.
The consortium that put the Stella report together is called “The SNAFUcatchers Workshop on Coping With Complexity”. “Coping” is a loaded word: it seems soft and fuzzy, and not easily addressable just by thinking hard. But the inescapable conclusion of the report is that just thinking hard is not enough any more. When systems are so complex that they are essentially unknowable, we need to adopt a different mode of relating to them.
And this is where mindfulness and related human skills like emotional intelligence come in. I think there are two key capabilities of mindfulness that are particularly relevant:
- If your mental map of the system is wrong by definition, then during an incident attachment to your own assumptions is your worst enemy. Mindfulness practice is quite explicitly an antidote to such attachment.
- Neither leaders nor engineers are generally comfortable with the concept of unknowability. If a complex distributed computing is unknowable by its nature, then individual and organizational anxiety naturally follow. Anxiety generates distraction and degrades attention. Mindfulness practice is specifically intended to improve attention and provide a stable context for working with emotions like anxiety with greater clarity and effectiveness.
Some other areas of the report I look forward to exploring in more depth:
- I’m especially interested in the key role that post-mortems play. In a breakout session at DevOps Seattle there was a great discussion about post-mortem facilitation and the tricky business of holding the space with authenticity and safety so hard conversations and good decisions can happen. Human skills like mindfulness and EQ make this seemingly magical process much more possible to achieve with consistency.
- The other fascinating arena is the quality of cognition that happens during incident response: you’re up late at night, the pressure is on, executives want answers, and you’re coping with complexity. In these circumstances, it’s hard to think clearly, and it’s very easy to make bad choices with potentially catastrophic effects. How do individuals and teams find the resources to bring their best thinking to these circumstances? You can probably guess what my answer would be…