Blog

Feeling safe

Last year at work, I personally caused a large outage at my company.

Disclaimer: this post represents my own views and it is not endorsed by my employer.

Without going too much into the technical details, I made a change in one of our systems. I go to check the results and see they’re wrong. I start rolling back the change; the progress spinner appears and sits there forever, because of course it does. I switch back to my Slack tab. My teammate has linked an automated alert - a big system is down.

"Is this us?" she asks. Shit. Yes, it's us.

She's being generous, by the way. There is no "us" - only me and me alone. I click to the alert to find a flood of messages from on-call engineers across the company. They're all affected, and the "+1" reactions pile on in real-time. In that moment, I’m immensely grateful that we're working from home, so I don't have to hear the shrill chorus of every pager in the office going off simultaneously.

It's kind of a blur from here, honestly. I pinged @here to notify hundreds of people of my failure in real-time. There's already an incident response Zoom call going, and I join to see dozens of mostly unfamiliar faces. They stare back at me. They're waiting for me to tell them what's wrong and how we're going to deal with it. I manage to explain the idiotic thing I had done and the likely impact.

The whole time, my thoughts are a constant refrain: Stupid, stupid, stupid. The worst parts are the lulls in activity, with nothing to do but sit there with my thoughts. Sit and wait for the rollback to finish. Sit and wait for more knowledgeable people to come help fix my mistake.

More than anything, I sit and wait for someone to come at yell at me for being an idiot, for doing this stupid thing, for ruining the day for our customers and our coworkers both. I've been yelled at before, so I know I'll probably survive, but the tense anticipation of waiting is almost worse. But I’m spared for now. We focus on fixing the immediate problem first. It’s resolved by the end of the day.

Now that the crisis is over, I can really stress out and wait for the other shoe to drop. I cringe at the sound that rings out every time someone messages me, expecting a harsh castigation and the accompanying shame. The pleasant voice of the "Hummus" notification isn’t so fun right now.

But as the minutes tick by, I am repeatedly surprised. I receive an outpouring of support in my direct messages, from people I talk to daily to those I'd never even met before, consoling me and sharing their own experiences. They were overwhelmingly supportive (and I don't really blame the one person who was more probing - understandable, given the situation).

In the postmortem meetings over the next few days, we spent most of the time discussing the situation I had been in, and the systems and processes that had allowed me to do the wrong thing. As far as I remember, we spent exactly zero meeting minutes discussing that this outage was caused by human error (i.e. a personal failing of mine); there was no proposal for me to be less of a Stupid Person and to just Do The Right Thing next time.

We came away with concrete plans to improve our systems at every level, and then over the next few weeks, we implemented them: improving the warning UI for dangerous actions, revising our process for making changes in general and re-training, and updating our runbooks to improve our ability to respond quickly and effectively.

That’s all that happened, really. Sorry, this is not one of those stories where the CTO yells at me to leave and never come back. No one threw me under a bus in a customer-facing briefing. It turns out in real life (or at least my life), it’s totally possible to totally screw something up, have everybody notice, and feel a sense of impending doom growing, and growing, and growing and growing, that feeling of anticip--

And then, the world doesn’t end. Anticlimactic, I know.

It's one thing to hear during training that your company practices blameless postmortems - it's quite another to experience it directly. I had both caused and helped fix other incidents before this one, but the scale of the outage and corresponding response really hit home for me.

In the days after the incident, I spent a solid chunk of time in a state of self-flagellation, scolding myself for my errors. Looking back, I'm glad that no one else thought it was necessary to pile on. Even with all the support and reassurance I received from my teammates, it took some time for me to regain my confidence and day-to-day productivity. I distinctly remember, more than once, suddenly blinking to realize I had been sitting there for minutes, just staring at my laptop.

Soon enough, I regained my focus and wrote code productively again. A bit later, I recovered the ability to loudly express my opinions in group meetings (too often, some might say). And in terms of growth, I'm more risk-averse these days - wouldn't want to cause another big outage, after all.

More importantly, going forward, I can feel safe. Knowing that I'll have the support of my colleagues, I have the confidence to confess my mistakes without fear of personal judgment or punishment. The postmortem process will help us improve our systems and processes so that everyone benefits - not just those of us who have undergone trial-by-fire.

And during future incidents, when someone else is sweating in the hot seat, I can reach out with the same empathy that others showed me, and tell them: I know you probably feel terrible right now. It's okay. No one blames you personally. We'll fix this, and then we'll fix the path that misled you here too.

I know, because I've been there too.


Thanks to Joanne for editing on this post. Thanks to my teammates, colleagues, and management for their support, especially those who reached out personally.

Some related reading:

reflection, workBobbie Chen