Holding Retrospectives in a Cloud-Native World
When I worked at AWS, we had retrospective review meetings every Wednesday to review system incidents that occurred the previous week. All the managers and senior leaders from all across AWS met to discuss what had gone wrong and what we could do differently to ensure the issue didn’t repeat.
The premise was good—as a leadership team, we reviewed individual problems in depth so everyone could benefit from the learnings.
Unfortunately, sometimes the meetings didn’t go so well. They could easily turn into blame sessions. With management and other service team owners quizzing the person currently in the spotlight, the discussion could feel like this:
“Tell us what you did wrong last week and why you won’t do that again.”
The questioning could get intense. Being in the hot seat was an intimidating and sometimes demeaning experience.
As a team leader for one of the early AWS services, I was part of this group and had to present problems with my service on more than one occasion. I always dreaded it.
You see, the intent of these retrospective meetings was good, but in the organization’s early days, the systemic implementation of the meeting had some issues. While I no longer work at AWS, I’ve heard from friends that the meetings still occur, but they are run more positively and with much more positive results. That’s great to hear.
Years of experience demonstrate the value of retrospectives. But, unchecked, retrospectives can quickly turn into blame sessions.
Be Mindful of Siloed Learning
With all of its faults, the “Wednesday meetings” had great value. They created an opportunity for learning between teams. One problem with cloud-native applications is that they can inadvertently make a siloed organization. Since an individual development team owns all aspects of the service they are responsible for—design, construction, testing, deployment, operation, troubleshooting, everything—it can create a siloed environment between individual teams. When one team has a problem and finds an innovative solution to that problem, how do they share it with all the other unconnected groups within the same organization?
This was easier in organizations supporting monolithic applications. In such organizations, everybody was involved when a problem occurred, and hence naturally, everyone was involved in the postmortem retrospective. As a result, the team as a whole is closer.
But in an organization supporting a cloud-native application, when a problem occurs, it’s often limited to one or a handful of services. This means only one or a small handful of teams are even aware of the problem. This is great from an application scaling perspective, but it means you can lose visibility into shared operational aspects across teams.
This is why AWS had these Wednesday meetings. They wanted to share with all teams what problems each team was having, with the hope that everyone could learn from the experiences of each other.
Avoid the Blame Game
Sharing retrospective outcomes across teams is a great way to avoid siloed learning. Still, it can quickly devolve into a forum for “putting on trial” individual teams to showcase where they failed—literally a trial with a jury of your peers.
This culture encourages individual leaders not to share results or to sugarcoat them to hide the bad aspects. This helps nobody.
How can you encourage retrospective knowledge across unrelated teams without becoming a blame culture? Here are a couple of ideas:
- Do a deep dive retrospective within the team that owns the service that had the problem. Here, in the small, closely connected team, the good, the bad and the ugly can all be shared openly. The details of the discussion don’t need to leave the team.
- Prepare a “learnings report” for the broader company. The report should focus on what happened and what tools worked and what tools did not work to resolve the problem. Then list updated best practices for dealing with issues like this in the future.
- The learnings report should focus on what happened, what resolved it and how it can be avoided. It should not focus on why it was allowed to happen or who did what to cause it.
The report must have the proper focus. It should sound more like, “Look what we did that helped the overall application/system.” This is a productive message. It should not sound like, “Sorry, this is how we failed and we promise we won’t do it again.” It should focus on results and improvements, not listing failures.
Blameless retrospectives are an essential part of any organization’s continuous improvement process. They can help foster a culture of openness and transparency where employees feel comfortable discussing problems and solutions.
But the more isolated teams of a cloud-native application can turn away from this culture and encourage an us-versus-them mentality. This needs to be discouraged and driven out of the culture at the highest levels of the company. Only by keeping retrospectives blameless can you truly achieve the continuous improvement that a cloud-native platform strives for.