Is Redundancy in the Cloud the new standard?

July 9, 2012

There are no shortages in the promises made by cloud providers –
• lower costs
• more modularity and portability
• instant scalability and capacity
• greater application flexibility
• and higher availability

Higher availability in the public cloud is not just predicated on traditional data center utility power, emergency power and cooling redundancy. No, cloud providers claim the availability benefits of virtualization by moving virtual servers and applications around from an unhealthy physical server to a healthy one within the same cloud. A more grandiose claim is “application portability,” which means that applications can be moved from cloud to cloud as needed – should one cloud become overloaded with activity or suffer an outage. Computer systems sometimes fail. It’s a fact. So added redundancy could be the “holy grail” of cloud computing availability.

But, is application portability real? And here today? I would say not yet.

There are no formal standards describing cloud interoperability—specifically, an application’s migration requirements. Also it’s quite an undertaking to operate servers and networks in redundant, synchronized, geographically isolated data centers.

Most people in the know would say Amazon’s data centers are some of the most technically advanced. But, Amazon Web Services recently suffered a long outage of one of its public-cloud’s availability zones. The outage started on the evening of 14 June and lasted until next morning, US Pacific time. It started with a cable fault in the power distribution system of the electric utility that served the data center hosting the US-East-1 region of the cloud in northern Virginia.

The entire facility was switched over to back-up generator power, but one of the generators overheated and powered off because of a defective cooling fan. The virtual-machine instances and virtual-storage volumes that were powered by this generator were transferred to a secondary back-up power system, provided by a separate power-distribution circuit that has its own backup generator capacity.

But, one of the breakers on this backup circuit was configured incorrectly and opened as soon as the load was transferred to the circuit. The breaker was set up to open at too low a power threshold.

This was an expertly designed emergency backup architecture that suffered from lack of testing and, most likely, human error in configuring the breakers.

Even if application portability had been mature and available, it would not have been effective in this case.

The moral of the story is that all data centers, cloud and enterprise, should be built on proven emergency power systems, preferably on standardized, repeatable ones. Why standardized and repeatable? For starters, performance can be optimized by continuous improvement to drive up efficiency and lower operating costs. More importantly, the bugs will be worked out and familiarity of the systems will assuredly speed up problem resolution and minimize downtime.

Tags: Cloud, Steven Carlini