A pointless, destructive way to vent anger when you’ve got a real problem

Oh crap. The BAD THING happened. There’s red flashing lights, alarm bells are going off, one of the junior devs is hiding under his desk, the devops guy is breathing into a paper bag, and the project manager is catatonic. You’ve got a problem.

Or, instead of an all-hands-on-deck emergency, the BAD THING is a day-in, day-out background radiation of annoyance and frustration. It hasn’t turned into an immediate, critical problem, but it threatens to turn the office into a Lord of the Flies scenario. Any day now, the QA team is going to fashion office supplies into crude weapons and start hunting the marketing team for food & sport. One of the senior devs will declare themselves a god and start a cult. You’ve got a problem.

Whose is to blame?

Assigning blame is procrastination

Think of it this way - your house is on fire. What matters more; finding the guilty party, or getting out of the damn house and calling the fire department? Obviously, rooting out the guilty party! Call everyone into the living room, and interrogate everyone. That way, when you die of smoke inhalation, you’ll shuffle off this mortal coil secure in the kowledge that it was somebody else’s fault. Literally the only benefit of finding somebody to blame is having the satisfaction of being “superior”.

Alternately, I would suggest addressing the problem first. Maybe the problem is a person, but in my experience, very rarely is that the case. If something has turned into a problem, it’s almost always an entire team at fault.

Real life examples

Gitlab database outage, Jan 2017

On January 31, 2017, Gitlab experienced a severe database failure. The postmortem describes pretty well what happened, so I won’t rehash the details here. What I will do is call out their response. This is one of the best responses to a critical failure it’s ever been my pleasure to see.

The individual who accidentally set off a cascade of problems that led to a severe failure wasn’t blamed. They weren’t fired. To my knowledge, they didn’t even face any internal reprecussion, other than any shame and embarassment they felt.

Amazon ELB outage, Dec 2012

On December 24, 2012, Amazon’s Elastic Load Balancing service failed. Again, the postmortem is sufficiently detailed, so I won’t go into details.

Same situation - a single person accidentally set off a cascade of events that led to a major service outage from a major provider. Again, to my knowledge, there was no internal discipline.

Anonymous reddit post, Jun 2017

This fantastic post on Reddit is a wonderful example of how a company should not behave. On their fist day, a brand-new junior developer was following the written instructions for setting up their development environment. Accidentally, they were able to completely destroy their production environment, due to a littany of problems that already existed. Appropriately enough, the Gitlab developer from the above story outlined several of the preexisting critical problems outlined several of them in the reddit responses.

Ultimately, the CTO blamed the junior developer, who lost their job. They found another one quickly enough, which isn’t surprising. I would have hired that person. They learned a very important lesson very early in their career, and I doubt they will ever repeat that mistake. But they shoudln’t have needed to find another job. The CTO was right to be upset, but they focused their wrath on the wrong person. It takes an entire team to screw up that badly. It would be like getting upset at the person who stepped on a landmine, instead of realizing that the fault lies in the fact that there’s a damn minefield in the first place.

Me, circa 2015

I was working at a small adtech company, and I deployed a new version of our core platform. In the process, this took down our entire production cluster. We were down for 8 hours while the CTO manually rebuilt the cluster from scratch. I was blamed, but much like the junior dev from the anonymous reddit post, I merely set off a landmine that the entire team had spent years carefully pussyfooting around. I didn’t lose my job, but I was chastised for not being “careful enough”.

The entire situation was pretty bonkers. A network-level dependency had been removed from the code, and the deploy script shut down the dependency, but didn’t remove it from the server configuration. This caused a freeze-on-boot while the server waited for the network dependency to be available. The servers were treated like pets, not livestock, so we never rebuilt them from scratch. There was no production-like environment to test deploys against to ferret out problems before hitting production. These were all problems I had brought up before and after the outage, but fixing them was never a high enough priority to be worth addressing. There was sound and fury about the importance of uptime, signifying nothing.

Fix the problem, or you’re the problem

When the shit hits the fan, one of two thoughts can immediately come to mind:

  1. What idiot knocked the pile of shit onto the fan?
  2. Why was a pile of shit so close to the fan that this accident could so easily happen?

If immediately default to #1, that’s a problem that needs to be addressed.

Steven Allen is a software developer with over ten years of experience.

He's seen many companies fail. Don't be one of them.

About me