Seven Operational Practices That Reduce Downtime During Large-Scale Cloud Incidents

High-scale systems fail in many unexpected ways that you would never have designed for. Over the course of the last 14 years, I have navigated the layers of physical and virtual networking. I started out as an individual contributor writing code for data plane services and transitioned to leading global teams managing highly distributed services owning millions of hosts. I have seen a wide range of incidents, such as multi-service impact, single-service impact, cascading failures, single customer issues, service failure during incident recovery, service failure post-recovery, and the inability for services to auto-recover. The list goes on and on. I have studied the root causes of major outages across the industry’s cloud leaders. There are common failure patterns across the industry. While these events are inevitable, based on my experience, adhering to the below best practices for managing failures will greatly improve your ability to handle them. These are the seven best practices I recommend to keep teams efficient during large-scale incidents and help reduce the impact time.

1. Mitigate First, Root Cause Later

During an outage, the natural tendency from engineering teams would be to find the underlying cause of the issue. However, you should always drive the discussion towards how to mitigate the issue. I’ve seen teams spend a lot of critical time debugging code while customer impact is still ongoing. Most of the time, you don’t need to know the root cause in order to mitigate the issue or to execute the recovery steps. If an incident correlates with an ongoing deployment causing a spike in 5xx errors, you should roll back the deployment immediately instead of debugging the code to identify the bug. If a single host is experiencing failure, remove it from the fleet immediately. You can perform the deep-dive analysis later once the impact is mitigated. If you have enough hands on deck, you can divide and conquer by tasking one group with immediate mitigation, and another with the root cause investigation.

2. Don’t Be a Hero: Ask for Help When You Need It

Earlier in my career, I mistakenly thought reaching out for help would be seen as a sign of weakness. I was always tempted to try to solve every incident myself in order to prove my technical and operational ability. But over time, I have realized this is often an incorrect approach since it creates a single point of failure, delaying the mitigation. The rule of thumb one could use is – escalate the moment you are blocked. A peer who is a domain expert or senior tech lead brings more experience to the table by correlating with previous outages – something you won’t be able to do easily under pressure or when you’re blocked.

3. Test Your Tools Regularly

Reliable tooling is critical and required to handle an operational incident. Teams often rely on scripts or automation that are only used during rare events, and they will likely fail since they aren’t exercised regularly. Not having the tools working during an incident when you need them the most will further delay the mitigation and create more pressure on the teams to use manual, untested approaches during incidents. Using an untested approach can always result in errors which may increase the impact or delay the mitigation further. You should treat your operational tools with the same rigor and quality as your production code. One way to do this is by having unit tests and component-level tests run every time the tools are updated or when their dependencies change in a pre-production environment. By catching software changes that break the tools immediately and not during a large-scale event, you will increase your team’s effectiveness and operational posture for handling service outages.

4. Verify and Validate

When making production changes in response to an ongoing incident, it is better to be slow and safe than to rush and break things. There are many examples of someone running a wrong command or making a manual change to a system, service, or database during an event that breaks production. Always verify production changes via tests, additional reviews, and approvals before executing them. This approach can be enforced by mandating every production change goes through a formal peer review or an “over the shoulder” second pair of eyes review. Taking an extra 30 seconds to verify your work will potentially avert errors that can cause more impact during incidents. After execution of a command or change, it is equally important to validate the result by querying or inspecting if the change behaved as expected.

5. Avoid the “Context Tax”

One of the most common patterns in large-scale event handling is not having a common understanding of the issue at hand. Every time a new person joins the bridge and asks the same questions about the event, it results in operators context switching from mitigation to explaining what they know about the issue. This pattern can be avoided by having a clear summary of the event written down in the event tracker and deferring questions to it. A good summary must include a start time, nature of impact (latency vs. errors), magnitude of impact, scope (partition vs. zonal vs. regional), recovery metrics to track, and active threads with owners and estimated time of completion. This approach helps avoid losing valuable time and allows operators to stay focused on mitigation. Another good operational hygiene is to always post a blurb explaining the relevance of a graph, instead of just posting the graph without any context. This tells everyone what the data is showing so they can support you more effectively.

6. Aggressively Filter Distractions

It is important to stay focused on mitigation during a large-scale event. Given large-scale complex events have many participants, many of them will have their own theory of what the issue might be. While it’s good to hear different perspectives and think of various possibilities, it is often counter-productive and can make the call go in circles for hours. This is usually the case because often these theories are not backed up by data or evidence. An incident manager must keep the discussion on a logical, data-driven path and track associated investigation threads in a visible document. If a participant offers a new theory that isn’t backed by data, it should be moved to a backlog of pending action items or should be investigated separately from the main threads.

7. Drills Over Documentation

Teams typically use documents, training videos, and standard operating procedures guides to onboard new members to on-call rotations. While this sounds like a reasonable approach, I have found that new members are more effective if they are provided hands-on exposure. You can achieve this by shifting your onboarding process to include operational drills in addition to training material. The operational drills can be simulated in a pre-production environment. During these drills, you can have your new on-calls mitigate the issue by using the tools, following the SOPs and executing the escalation process, similar to how they would handle production events. Being well-prepared through drills will help operators stay calm and be better equipped to handle real events.

Final Thoughts

Networking components and large-scale distributed systems relying on cloud infrastructure have become critical, foundational components for many software companies. It is essential to ensure high availability and resilience for these components. We have read about many cloud outages that can disrupt day-to-day operations, impacting several sectors of the industry that rely on cloud companies. As the complexities of these systems increase over time, it is important to build discipline in operational hygiene to manage them. Especially given the rise in AI adoption, the interdependencies between services are multiplying at a rapid pace. While it is not possible to completely avoid failures, it is critical to have processes and a culture in place to recover quickly and ensure minimal disruption for use cases dependent on cloud technologies. By prioritizing the best practices mentioned above, we move from a reactive mode to a proactive, well-prepared, and more disciplined operational culture.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.