Data centers and nuclear submarines may seem like completely different beasts, but the similarities between a mission critical data center and a mission critical nuclear propulsion plant are striking and many. Human error continues to be cited as a leading cause of data center downtime and the goal of eradicating this blight from the data center industry can be advanced by studying the US Nuclear Navy. Let me draw some comparisons between nuclear subs and data centers and demonstrate how Nuclear Navy methodologies can be applied to modern day data centers to lower human error.
Nuclear Subs are doing something right:
Stuffing a nuclear reactor plant, a steam plant, an electrical plant, cooling plants, auxiliary systems, and more (that’s just the back half of the sub) into a submerged vessel is an extremely difficult task and leads to very complex systems. The fact that human error has been successfully minimized in this environment is a truly phenomenal accomplishment. The processes and policies for operating a nuclear sub are understandably tightly regulated at many levels. In addition, there are multiple levels of system redundancy and interlocks with a back-up system to the back-up system in many cases. But no matter how regulated a process is nor how automated a system is, humans are the master overseers and eventually something always seems to go wrong. Therefore, there is an intense focus placed on the people that operate these submarines. It starts with a very competitive selection process to ensure you have the personnel you need. Once selected, there is approximately 15 months of training prior to arriving on board. Once on board, the ongoing training and qualification process continues indefinitely with that very same intensity. The skills used on these subs are so highly regarded that Lee Technologies, a data center services company, seeks out ex-“Navy Nukes” for hire. So what lessons can be applied to the world of mission critical data centers?
Lessons to apply to data centers:
There is no substitute for hiring the right people to run your submarine and the same is true for your data center. As data center owners, we need to hire the best and provide the leadership that instills the desire to perform at one’s top potential. Secondly, utilizing non-deviation procedural compliance is critical. Implementing Standard Operation Procedures (SOP’s) for everyday operations and Methods of Procedure (MOP’s) for maintenance is very doable and should be mandatory. Having an Emergency Operation Procedure (EOP) that is easy to memorize and readily available can prove priceless if the need to use it ever arose. Imagine knowing exactly what to do to stabilize the data center should a generator not start or a breaker unexpectedly trip. Some other non-deviation initiatives might include the use of status boards, change control processes, and methodical documentation of all maintenance. The sharing of knowledge/lessons learned from actual incidents can prevent much future downtime and the US Navy has developed a formal program around this sharing. There is no good reason the data center industry can’t do the same. Lastly, my Navy experience taught me that continuing education via on the job training, surprise performance drills, and additional formal schooling are imperative to the long term minimization of human error and fostering continued process improvements. You never stop learning. Commercial reality states that you can’t have individuals away training for long periods of time without contributing to the bottom line. Other real world restriction such as HR policies and the inability to make people “follow orders” may also pose a challenge. Still, there is much that can and should be done. In addition, if this all sounds a bit daunting, it’s important to understand that data center operations can be outsourced. I would simply recommend that you choose a provider that adheres to these philosophies.
Can’t live with ‘em, can’t live without ‘em
The goal is to remove the “error” in “human error,” not the “human.” Although much needs to be put in place to take out human error, having the human in the process is critical to success. The simple fact is that no machine can reason like a human and hard decisions often have to be made. Challenge yourself to find what might apply in your specific situation. The goal is to apply risk mitigation where ever possible. Good luck in the fight against this blight and let me know what has been working for you by leaving a comment below. I’ll leave you with this quote describing a Nuclear Propulsion Plant Operator:
“Tough Minded, Skeptical, Sometimes Even Cantankerous, But Always Technically Competent, ALWAYS THINKING – What If? A SPECIAL BREED. He Makes the Difference Between Safe and Effective Operation and Unacceptable Risk.”
Please follow me on Twitter!
About Domenic Alcaro:
Domenic Alcaro is Vice President of Enterprise Sales for Schneider Electric’s Data Center Solutions team. Prior to his current role, Domenic held technical, sales, and management roles during his more than 14 years at APC including Customer Service Team, Inside Sales Manager, District Manager, EAM, Business Development, Director of the Availability Science Center, Enterprise Regional Manager. In his most recent role as Director of the NYC and Philadelphia Metro Region, he was responsible for helping large corporations improve their enterprise IT infrastructure availability. Domenic is a frequent speaker at various industry conferences on topics such as business continuity, physical infrastructure of information technology, and data center design. Domenic holds a Bachelor of Science degree with honors in electrical engineering from the University of Rochester and is a member of the Tau Beta Pi Engineering Honor Society.