Tuesday, April 26, 2011

Amazon EC2 Outage and Cloud Strategy

Last Friday, Amazon experienced a partial outage of its cloud infrastructure.   Here the initial update and the closing updates:


Event Issue
"The problem started with a "networking event" that led to problems with how data is mirrored: We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS [Elastic Block Storage] volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them."

Closing update from Amazon:

As we posted last night, EBS (Elastic Block Store) is now operating normally for all APIs and recovered EBS volumes. The vast majority of affected volumes have now been recovered. We’re in the process of contacting a limited number of customers who have EBS volumes that have not yet recovered and will continue to work hard on restoring these remaining volumes…
We are digging deeply into the root causes of this event and will post a detailed post mortem.

One of the unfortunate realities of infrastructure and operations is that the goal will always be 100% uptime for all infrastructures but it cannot be achieved.   The SLAs for infrastructure and operations is very unlikely to be 100%.   The strategic question will always be what SLAs can be afforded, what is the impact to business agility for the target SLAs and what can be improved from a people, process and technology perspective to achieve the business goals and minimize cost. 

Because there are clear ties between performance, availability and security objectives and the success of outsource cloud infrastructure and operations, I believe that public cloud will outperform internal infrastructure over time.   This does not lessen the requirement for internal roles of architecture, end-to-end management of performance, availability and security, and vendor management.   These roles will increase in importance within organizations. 

The current Amazon issue re-emphasizes that a cloud strategy needs to include

  • Clear and continuous risk management program for IT
  • Enterprise change, incident, problem, release and configuration management process re-engineering
  • End-to-end SLA and systems management
  • Server provisioning process and technology
  • Patching process
  • Server configuration baselining and auditing
  • Repurposing of servers
  • Disaster recovery planning and testing

No comments: