This audio was created using Microsoft Azure Speech Services
When more than 600 million people lost power in India during the largest blackout in history, one question resonated for those of us who work in mission critical systems: how did this happen? Many theories on the cause of the immense blackout arose, from the plausible excessive demand on the grid to the improbable large solar flares that set off the failure. The confusion and blame game that ensued only further delayed fully restoring the power. A similar scenario played out just weeks earlier when the Washington, D.C. area experienced a massive blackout, which resulted in extended down time for several data centers located in one of the busiest data center hubs in the country.
When a power failure occurs, such as the ones in India and Washington, D.C., and mission critical systems are in jeopardy, everyone tries to point out who is at fault. However, the process of deflecting blame only takes away from the issue at hand: finding the root cause, whether internal or external to the system. Only then can corrective measures be taken to ensure that it doesn’t happen again. The process of uncovering the underlying factors is streamlined if the infrastructure and event reconstruction technology is in place.
These recent power failures are poignant examples of the need for event reconstruction technology. When in place ahead of time, you can unearth the event that triggered the cascading series which eventually lead to the non-operating system. As potentially was the case in India, a single event causes the system to go into a state that elicits multiple events. Such multiple events add to the confusion as they may or may not be the root cause of the resulting system state, only a consequence of the system being in the previous state. Further, external occurrences while a system is in an unstable state can cause additional complications.
In our world of mission-critical data centers, we cannot escape a basic conclusion: even the best designed system is in a constantly changing state, and ultimately the net result of these changes is a sequence of events that has enough effect to cause an incident. Many of us have seen first-hand the need for event reconstruction if system reliability is to be optimized in the aftermath of an incident. But, where should you start collecting the requisite data?
Unfortunately, there is no clear cut laundry list of data point locations, but the first step is to identify the types of potential events to be recorded. It is crucial that the tools used to gather event data are able to time-synchronize with millisecond accuracy. Once the measurements are time ordered, an overarching system familiarity is required to make sense of the infrastructure changes and to arrive at a root cause for the incident.
Reliability principles indicate that no power system can operate 100% of the time indefinitely. The key to incident recovery is in the ability to quickly understand what went wrong and implement corrective actions. It is difficult to measure the ROI of event reconstruction technology. But when it is needed, it is invaluable. Maybe by taking a lesson from the recent blackouts and employing new event reconstruction technology systems, epic power failures can be avoided in the future.
If you would like detailed information on event reconstruction technology, please refer to the application theory, Power System Event Reconstruction Technologies for Modern Data Centers, at sedatacenters.com.