The AWS US-East Outage: A Wake-Up Call For Cloud Resilience
What do my daughter waking me up because Alexa won’t play a song, a colleague being stuck at Charles de Gaulle Airport, and problems logging into the UK’s HMRC website have in common?
Answer: The fourth outage in five years for AWS’s US-East region, its oldest and largest for web services. The problem was traced to DNS resolution failures that affected many core services including DynamoDB, EC2, Lambda, IAM, and routing gateways. While AWS reported mitigation within hours, lingering impacts continued to affect consumer platforms, financial services, enterprise tools, and government portals, to name a few.
AWS powers millions of websites and applications, elevating a technical glitch from an inconvenience to a global disruption. This particular outage exposes core issues with cloud resilience that stem from overreliance on services such as DNS, which were not architected for cloud-era technology demands. It also highlights how concentration risk — a dangerously powerful yet routinely overlooked systemic risk — arises when so many companies across all industries become dependent on a single cloud provider and, more pertinently, a single region covered by that vendor. But the problem goes beyond internal AWS regional dependencies into the logical dependencies across the platform. DynamoDB, the first service identified as impacted by the DNS issues, plays a central role in other AWS services for analytics, machine learning, search, and more.
The Shared Responsibility Model Shifts Blame To The Customer
There’s great appeal to using tech giants, but assuming they are too big to fail or inherently resilient is a mistake, with the evidence being the current outage and past ones. As for resilience promises, AWS directs customers to its shared responsibility model as a way to highlight where it takes ownership of service availability and what customers are responsible for. But when core services like DNS fail, even well-architected applications can become unstable. AWS works to fix its infrastructure, but many enterprises are left to wait until that is done even though they have followed recommended design patterns. This is not exclusively an AWS problem, but it has become a recurring issue, specifically for the US-East region, with customers left holding the bag when it comes to the impact of the outage.
Concentration Risk And Cascading Issues Make Resilience Efforts Cloudy
Convenience often overshadows navigating the complex, nested dependencies in highly concentrated environments. Despite past outages, organizations that failed to address that complexity got a front-row seat as cascading issues disrupted systems, processes, and operations. The entrenchment of cloud, especially AWS, in modern enterprises, coupled with an interwoven ecosystem of SaaS services, outsourced software development, and virtually no visibility into dependencies, is not a bug — it’s a feature of a highly concentrated risk where even small service outages can ripple through the global economy.
What You Can, And Should, Do Now
From a cloud resilience perspective, enterprise tech leaders have two lines of action they need to pursue now: Build the tools to increase technology systems’ reliability, and address contractual gray areas related to shared responsibility models with cloud (and SaaS) vendors.
On the technology side:
- Invest in infrastructure observability and analytics. It is the first line of defense for production systems, giving you early visibility into outages so that you can respond with workarounds or alternative infrastructure. Otherwise, you’re relying on a cloud provider’s blog describing the outage when it’s already taken down key operations.
- Build an infrastructure automation platform. In order to fix things as early as possible, your observability data and the correlated analytics need to be connected to automation to respond while problems are still small and manageable. These capabilities converge in AIOps platforms, but each capability is independent and should be considered strategically. Third-party tools can give you a bird’s-eye view of your overall cloud estate, especially in multicloud environments.
- Use content delivery networks to cache static content at edge locations to shield users and dependent applications from origin outages. That won’t be cheap, but neither is an outage that knocks down critical IT operations and leaves you waiting helplessly.
- Develop application portability and additional clouds for key workloads. If you have a critical application, be ready to move it on a dime. This might mean a disaster recovery (DR) architecture to another region, cloud, or datacenter. It may involve investment in data resilience tools or replication technologies. The details will depend on your specific application needs, and evaluating those needs should come from a well-designed risk management process. Focus investment on functions that affect customers, drive critical infrastructure, or move money.
- Test your infrastructure and application resilience. Use chaos engineering tests to figure out how your applications fail, and design ways to avoid failure. For DR and backups, test them to make sure you aren’t missing key steps, that the processes are clear, and, when it is a security-related matter, you coordinate well with colleagues responsible for securing enterprise systems and data. Supplement those catastrophic ransomware-response tabletop exercises with workshops on how to withstand protracted outages or maintain transaction integrity during short-term disruptions.
For managing third-party risk in cloud and SaaS suppliers:
- Understand the limitations of regulations. The EU’s Digital Operational Resilience Act (DORA) is an attempt to improve resilience of critical infrastructure, but it has limitations. It’s aimed solely at the financial sector and focuses on responsibility of cloud customers while ignoring the role of hyperscalers in improving the core resilience of their systems. Organizations should not confuse being compliant with being resilient. Instead, identify the three main sources of risk, model scenarios, and create mitigation plans to minimize the pain of disruption.
- Map critical dependencies. Identify and map all third-party and cloud service dependencies to the technology assets and business processes they support. Focus on customer-facing apps and single points of failure and hidden connections that could amplify outage impacts. Don’t settle for web documentation: Insist that your cloud technical account managers walk you through the specifics of your environment.
- Reevaluate your third-party risk strategy and approach. If a third-party risk management (TPRM) program is overly focused on compliance, you’ll likely miss significant events like this one that impact even compliant vendors. Tech leaders can’t afford to overlook assessing the vendor against multiple risk domains such as business continuity and operational resilience, not just cybersecurity. Tech leaders also need to map their third-party ecosystem to identify significant concentration risk among vendors, especially those that support critical systems or processes.
- Use the contract as a risk mitigation tool. With major technology outages becoming all too common, work with procurement and legal to update or add clauses that assign accountability during disruptive events and clearly outline time frames for vendors to patch and remediate. Consider using such incidents and their impacts as a basis for implementing measures in contracts or service-level agreements. If you want financial compensation or discounts for downtime, be prepared to bargain for it. If vendors push back, consider whether the price you negotiated still makes sense and, possibly, whether to do business with them at all.
- Prioritize continuous monitoring and corrective action. When companies identify third-party risks but don’t act on them, risk management efforts stagnate or fail. Your third parties are dynamic entities, and their risk, compliance, and resilience posture will change over time. Augment annual assessments with continuous monitoring tools that can identify vendor changes in real time. Close the loop with TPRM platforms that automate issue creation, launch remediation plans, and trigger notifications required for approval and verification that the risk has been addressed within risk-appetite or regulatory requirements.
- Test vendor resilience plans. Require validation of your vendors’ recovery and continuity plans through tabletop exercises and outage simulations.
If you’re a Forrester client, request an inquiry or guidance session with us to build a stronger cloud resilience strategy for your future.