In the traditional job description for an enterprise software developer, the day ends when they check in their code and head for the door. If the application they are working on malfunctions in production, they might be consulted during work hours, but they’re not, typically, woken up in the middle of the night. That job – being on-call to respond to production issues on the spot – falls to the site reliability engineer (SRE).
But today we need to re-think about who carries “the pager,” that is, who’s woken up in the middle of the night when there’s an issue with deployed code. (The “pager” today may be a smartphone app, or in some cases an actual physical pager. Regardless, the impact on your sleep cycle is the same.)
In 2015, when I was an SRE and we were launching a new online video service, I was on pager duty a lot. There were several middle-of-the-night fire drills that involved issues with the applications, and the authors of the applications were not on the call. In such cases, we did what we could to make the application functional again, and waited until morning to get the issue addressed more permanently.
Was there, and is there, a better way? Who, really, should carry the pager? Is it a burden SREs should shoulder on their own? Or should developers be alerted when code they authored breaks? I believe it is a is shared responsibility: Both SREs and application developers should get pager duty. Here are three reasons why.
Operations and developers each have their areas of discipline and ultimately over the code they manage — which hopefully was built with quality from the beginning. Of course that does not mean code is free of defects. In many organizations, when an alert is triggered and the operations team that responds, a quick fix can be as easy as restarting processes. In some cases, there is a much larger issue that needs the attention of the application developers. In such cases, operations plays the critical role of providing information gathered from metrics and logs to help an application developer troubleshoot the issue.
So if an incident that needs an application developer’s attention occurs after hours, the operations team remediates the issue by restarting processes or putting other stopgaps in place that last until business hours when application developers are available. But If application developers received alerts alongside SREs, it would bring those developers into the fold during a service disruption, so they could develop first-hand experience of the issue in real time, thus providing insight into how their code performs in production. When developers and maybe even architects participate, it could lead to better decisions being made upstream in the architecture, design, and coding stages.
Every creator deserves to get the insight of watching their creation at its toughest moments.
With shared pager duty, the right people can work on the issues they own. In other words, aside from restarting a process or application, there isn’t a lot an operations person can do with the application code itself should it fail. In addition, it is more difficult for SREs and operations to learn lessons about how to better construct the application, and that that knowledge better serves application developers anyway. Knowing that alerts could wake you up in the middle of the night would create a stronger sense of ownership along with the immediate sense of urgency behind incidents. The threat of a pager call might even improve software reliability.
SRE and operations teams must still be on the hook for maintaining the infrastructure and they do not escape middle of the night wakeup calls during an incident or outage. Only the scope of responsibility gets new limits. Alerting could also spill over to operations but that’s an escalation or transfer of ownership made in real time.
As the management saying goes, “Never waste a crisis.” The insights provided by critical incidents is valuable and stays with the development team due to the direct experience of seeing applications in production. Feedback is immediate and the handoff to other developers is faster than waiting for a ticket or issue reported by the SRE team. With competing items on reserve instead of on deck, there could be some time that passes before time is allocated to addressing the issue and therefore context is lost and with it, along with valuable knowledge that the development team could have otherwise added to their collective base of experience.
Of course, there are no hard and fast rules for who should participate in on-call rotations. I outlined the benefits to an organization should developers and operations choose to share the on-call duties. But what do you think? Comment below and tell us about it from your point of view.
We’d love to hear what you think. Ask a question or leave a comment below.
And stay connected with Cisco DevNet on social!