Mind the Cloud - Home
Amazon Outage, Complexity and Interdependence
Thoran Rodrigues
OCT 25, 2012 14:47 PM
A+ A A-

If you follow any kind of news on cloud computing, or even technology in general, then you’ve probably heard about the recent outage that Amazon’s cloud services suffered on Monday, October 22. Why do we still see major sites and services being taken down by such outages?

Throughout the day, Amazon had availability, performance and connectivity issues with most services on one of their datacenters or, as they call them, availability zones. While some people may still take this opportunity to declaim the dangers and issues with cloud computing, it should be a well-known fact that cloud services suffer failures and outages.

Today, any service failure is much more noticeable. As we’ve come to depend on sites and web-based services for everything from reading the news and getting email to keeping our to-do lists in order, whenever one of these sites that has a large user base goes offline, it becomes instant news. While in the past small service interruptions had a chance to go unnoticed, as the user base grows, any disruption or reduction in service quality is quickly identified by users and must be addressed.

This noticeability of failures, added to the fact that quality of service issues can lead to user desertion, means that all major websites and services should be very prepared to deal with issues originating on their underlying providers. They are, after all, major sites, so they should understand the need for contingencies, secondary hosting providers and so on. Why is it that whenever a cloud provider, such as Amazon, goes down, so many sites go down with it?    


Complexity and Interdependence

The main issues surrounding these failures are matters of complexity and interdependence of services. Gone are the days when a website (or web service) were simply a physical server somewhere connected to the internet. Today, most large sites and services rely heavily on cloud computing to optimize costs and improve the user experience:  virtual machines for easy scalability and replication, virtual storage to simplify content management, content delivery networks to better reach a wider audience, load balancers to improve performance, and so on. It’s easy to think that, just because your favorite website is hosted on Amazon, this means that all they have is a virtual machine instance that could be replicated on another datacenter. The truth, however, is that environments are usually much more complex and harder to replicate than that.

This complexity and interdependence between systems creates “nightmare” scenarios, in which failures in one part of the system can quickly propagate to others in a runaway fashion. Let’s take Monday’s outage as an example: Amazon’s troubles seem to have started with an issue related to their virtual storage service (Elastic Block Store, or EBS). This issue, in turn, started impacting virtual machine performance and the performance of their database service, which then quickly propagated to other services, from cloud search to the management console.

It’s quite easy to imagine a situation where “naïve” automated recovery processes developed by Amazon customers would detect a performance issues and simply try to reissue requests or allocate new storage / virtual machines / database instances. This could quickly flood their servers with service requests, thus impacting other services. Even in the case of the management console: what happens when a large number of users experiences problems and tries to log into the management console and issue requests from there all at once? While this cascade of failures may not be the true cause of the service outage, it isn’t an impossible scenario.

The failure of multiple services, like the one that happened, can create other issues still. Even if a customer had set up an automated process to monitor their cloud services, the performance degradation of servers and the issues that plagued Amazon’s monitoring service (Cloud Watch) could make any monitoring process useless. When we build software to run on the cloud, it becomes necessary to add an additional layer to our thinking. It isn’t enough to simply decouple systems in terms of multiple servers: we need to think in terms of multiple geographical locations, or even, ideally, in terms of multiple service providers. Does your software rely on features that are provider specific, or can it run across providers? Is your data stored in a fashion that it can be replicated on (and accessed from) multiple systems, locations and providers? Is your monitoring being done independently of the monitored system, in a fashion where cascading failures can be detected and dealt with properly? These are all questions that demand an answer when looking at the cloud.

To learn more about Amazon’s service outage, read my other post here, or go to their status dashboard.

[%= name %]
[%= createDate %]
[%= comment %]
Share this:
Please login to enter a comment: