Blog

Being on-call sucks

Disclaimer: this post represents my own views and it is not endorsed by my employer.

At tech companies, it is fairly common for developers to be "on-call". As Increment magazine (published by Stripe, a leading payment-processing software company) explains:

Similarly to the practice of doctors being on-call at a hospital, a set of engineers is placed on an on-call rotation... they are paged any time something breaks (usually via an automated push notification on their smartphone, a text, or a call), and they are responsible for quickly responding to the page, fixing what broke, and making sure that the same problem never happens again. On-call engineers are the "first responders" of software engineering.

Who Owns On-Call? - Increment Magazine, April 2017

In theory, this practice incentivizes developers to write better code and quickly address operational issues. If you get woken up in the middle of the night because of poor error handling code, you're going to figure out how to fix it, and put guardrails in place so that it doesn't happen again.


Being on-call is rarely a surprise, at least for software engineers. It's right there in the job description usually, or you hear about it in the interview process. When I was on-call, I didn't think it was a big deal.

Responding to incidents was stressful when I was getting started, but it quickly became a pattern-matching exercise. Recent deploy? Roll it back. A certain error? Follow the runbook to fix it. Something really arcane is broken? Call in the one actual expert for help (and pray that they pick up the phone). One of our vendors is down? Figure out a workaround if possible, and otherwise wait. After a few shifts, the stress level decreases (though it never fully disappears) and it can be exciting, even enjoyable to flip the switches and play firefighter.

It's easy to lower your standards the longer you've been on-call. On your first on-call shift, you question why you're being paged for a mysterious warning, one that resolves without you having to actually do anything. That initial window of heightened awareness is valuable and fleeting. As it happens again and again, you slack off; you recognize it's that alarm that isn't really alarming, and resolve it without even rolling out of bed. I've done it before, and I've seen it happen with new engineers on my teams. This normalization of deviance makes it easy to ignore an actual problem because it feels like yet another meaningless non-alarm alarm.

When it comes to deciding what to work on next, you have to choose between a small and concrete piece of code that brings new functionality to the product, or... a monitoring change that fixes some flaky alert that might page somebody about once every six weeks. What's the value of getting rid of that infrequent annoyance? To fix the monitor is to remove a grain of sand from the mound: no single grain is particularly significant, but over time you can make a noticeable dent. In practice, we budgeted some time each sprint to constantly chip away at these kinds of operational improvements.


Last year as part of my move to product management, I was removed from the pager rotation for my old team. To my surprise, I felt an incredible sense of relief and freedom. Perhaps I had actually just suppressed my own negative feelings about the unavoidable chore.

My phone makes a distinct little "ba-dum" noise when I turn the ringer on, which happens right before I get paged (since I usually keep my phone on vibrate). I would flinch whenever I heard it, expecting the PagerDuty alarm tone to follow (I recommend AH!). It took a few months to shake that Pavlovian association, and now I can finally turn on my ringer without thinking about work.

It was nice to have peace of mind again - the ability to go grocery shopping without my laptop, or make plans to go hiking without checking my on-call shifts. A friend of mine was on-call 24/7 for the entire week, every three weeks, due to some staffing changes on her team; it was a real drag on her ability to do anything outside the house. And at a different friend's birthday party earlier this year, I watched someone sheepishly find a quiet corner to crack open their laptop. I felt for them; that was me too, until very recently.


If you truly want to have software that runs 24/7, it requires someone to be available to take a look when it stops working, even at 4 in the morning. And so we automate turning it off and back on (thanks, Lazarus); tune the alerts; improve the runbooks; pay down tech debt; expand the rotation, follow the sun, whatever. It helps a little bit, but doesn't feel fundamentally different.

I originally wrote here, "We have accepted the on-call we think we deserve", but that felt a little defeatist. It's true that there is tacit acceptance of today's on-call experience (people do tolerate it, after all). Although working in tech is cushy already, I still hope the on-call experience can be improved. Does anyone have One Weird Trick™ to fix it? If so, let us know.

Thanks to Michael L., Debbie, and Lois for feedback on this post.

workBobbie Chenon-call