matthiasr/ops.md

## ops.md

      
    Raw
  

              ops.md
            
          
    Q: We are redoing our centralized on-call, ideal would be to have each 4-5 person pod on call for their own services but there's concern about the productivity hit.
A: For now, let's set the human element aside and only consider work hours.
Something I think about a lot in these situations is the feedback loop and responsibility. Take the point of view of whoever decides what's a priority sprint over sprint: if the systems are shoddy and it takes one of four engineers to keep them alive, they take the hit to their feature delivery. If the keeping-alive is done by some central ops team, or a diffuse on-call rotation that only hits your team every 3 months, congratulations you've externalized the consequences of your decisions while getting to claim credit for all the stuff you've delivered. The SWE productivity hit your managers are lamenting is the cost of internalizing these costs. The benefit is autonomy in making prioritization decisions locally based on what makes sense for this area. Structurally, you can choose between centralized on-call with mechanisms to push back on quality issues (change management, mandatory review, slow releases…) or decentralized on-call where the effects of local decisions are felt locally.
Many issues during the day stem from changes being made, and it's actually helpful for team velocity to keep on call close – it means you can be more loose about making sure nothing changes if you know doing it the quick way it doesn't affect the user experience unacceptably. My favorite example here is a team wanting to be on call for their database because that meant they could YOLO ALTER TABLE all day, without worrying about replication delays that the central database on-call would not tolerate, because they weren't using the replicas.
Now one-in-four on call is acceptable during work hours, you're just working on different things, but not sustainable outside. You don't have to have the same setup for both though – you can set up the schedules and escalations so that during the day, the pod gets paged first and the central (or probably better: zone level) on call is secondary, and outside of work hours this inverts. This has the added benefit that the pod stays sharp about the operational reality of their system, and has the muscle to resolve issues quickly if they get called in for a tricky issue. They're also incentivized to keep their runbooks for recurring issues or generic mitigations in shape, because it means fewer out of hours escalations.
No results found