By: Eric Gallant
34 years ago the Big Apple suffered a blackout of historic proportions. The blackout, combined with the stresses of a protracted economic depression and a brutal heat wave, tipped sections of New York City into chaos. Rioting, arson, vandalism and looting rampaged through sections of the city. By the time power was restored late the following day, 1,616 stores were damaged, 1,037 fires were responded to and, in the largest mass arrest in the city’s history, 3,776 people were arrested.
Following the restoration of power and civil order, the public and the politicians of NYC turned their anger on the city’s electrical power provider, Consolidated Edison (Con Ed). Mayor Beame repeatedly and roundly condemned the utility and accused Con Ed of “gross negligence”.
An analysis of the causes of the 1977 blackout reveals that the tragedy was caused by a combination of weather events, lapses in preventable maintenance, a lack of operator training and a lack of adequate operating procedures. A close look at key incidents in the blackout’s timeline can provide some valuable lessons to the operators of data centers and mission critical facilities of 2010.
- At 8:37 p.m. EDT on July 13, 1977 lightning struck a substation.
Lightning strikes are obviously very common. As a result, most facilities are protected from lightning damage using a combination of lightning rods (air terminals), grounding conductors, cable connectors and ground terminals. NFPA-780 and UL 96A are the technical standard that cover the design and installation of these systems. NFPA-780 is regularly updated with the most recent update in 2008. If you have not examined your lightning protection system in a while, an inspection should be conducted to ensure that the system complies with the most recent requirements. In addition, as with any system that is exposed to the elements, your lightning protection system is vulnerable to physical damage and corrosion. Regular system inspections and testing is recommended to ensure that your protection system is intact when the next lightning storm passes your way.
- 8:37 p.m. EDT; lightning causes two breakers to trip. A loose locking nut and an unperformed equipment upgrade prevented the breakers from reclosing.
Maintenance of your electrical distribution system is vital to the operational continuity of your data center. Even the most robust and fault tolerant data center design can be rendered inoperable if proper maintenance practices are not followed. One of the most informative electrical system preventive maintenance procedures is the Infrared (IR) Scan or Thermography. This type of scan can detect loose electrical connections, faulty components and overloaded circuits by their heat signature. Regular IR Scans can prevent not only equipment malfunction but potentially catastrophic electrical fires.
Data center maintenance best practices dictate that the OEM be used for critical system maintenance whenever practical. One of the reasons that this is a best practice is that the manufacturer’s authorized field personnel is the first to know when an equipment upgrade, recall or technical advisory is issued. Cutting cost by utilizing third party maintenance providers can separate you from the vital communication loop with the manufacturer.
- 8:45 p.m. EDT; Con Ed attempts to remotely start generators. However, no one was manning the station, and the remote start failed.
In the event of an emergency, there is no substitute for having properly trained personnel on site to take immediate and appropriate action. As The Uptime Institute points out in its recent white paper, Data Center Site Infrastructure Tier Standard: Operational Sustainability, “The right number of qualified people on appropriate shifts is critical to meeting long term performance objectives.” This paper goes on to explain precisely the number and qualification level of the personnel required based on facility criticality and infrastructure profile.
Additionally, regular and rigorous testing of backup systems is crucial to their reliable operation. The time to discover that your emergency power generation system will not start is definitely not during an actual emergency.
- 9:14 p.m. EDT, New York Power Pool Operators called for Con Ed operators to “shed load.” In response, Con Ed operators initiated two time-consuming system-wide voltage reductions. The Power Pool operators had in mind opening feeders to immediately drop about 1500 MW of load, not reduce voltage to reduce load a few hundred MW.
In a mission critical environment, no procedure should be started without a detailed, step by step, tested Standard Operating Procedures (SOP) or Methods of Procedure (MOP). These documents should be developed for all preventive maintenance activities and all procedures to change electrical and mechanical system configurations. In addition, all involved parties should be required to review each SOP/MOP and certify that they will comply with the procedure. These documents greatly reduce the risk of human error, communication breakdowns and prevent failures. A SOP for load shedding that Con Ed and NY Power Pool operators had been trained on and agreed to may have averted the complete collapse of NYC’s power grid that followed. Similarly, the correct documented procedures can mean the difference between continued operation and a business killing outage in your data center.
- 9:36 p.m. EDT, the entire Con Edison power system shut down. By 10:26 PM EDT operators started a restoration procedure. Power was not restored until late the following day.
In 1977, as it is today, Con Ed has the least interrupted electrical service of all utilities in the nation. Nevertheless, in 1977 it failed catastrophically and then failed again in 2003. The lesson here is twofold;
a) The utility power to your data center can (and probably will) fail at some point. That power outage could last for a day or more. Policies, personnel training programs, procedures and documents that minimize the effect of these outages are vital to the operational sustainability of your facility.
b) There is room for improvement in every organization. Despite Con Ed’s track record for reliable performance, nature revealed the gaps in their maintenance, operation and training procedures. Reliable operation of a mission critical facility depends on constant improvement of policies and procedures.
The 1977 NYC blackout resulted in widespread changes to the operational procedures and maintenance practices used by Con Ed and other electric utilities. What is the impact to your business if your data center fails because of a lapse in critical system maintenance? What is the consequence of missing an OEM upgrade? What is the impact to your business if your data center staff fails to take correct action in the event of a system failure? I suspect that the impact on your business is catastrophic and analogous to the 1977 NYC blackout.
Interested in learning more about natural threats to your data center? Download Eric’s free white paper, The Threat of Space Weather to Data Centers