As we mentioned in a previous post, human error is the cause of most data center downtime, rather than design flaws or system failures. Taking steps to avoid human error is therefore crucial to avoiding downtime. The best way to accomplish that goal is to implement an Operations and Maintenance (O&M) program that is purposefully designed for the data center environment and tailored to meet your business requirements and specific facility design.
Drawing on its long data center experience, Schneider Electric has come up with a white paper that lists a number of mistakes to avoid when deploying a data center O&M program. In our previous post, we looked at some of the people-related mistakes and, as promised, this time we’ll cover some of the mistakes that are process and procedure related.
The first is a failure to exercise the procedures that are to be followed in the case of an emergency. As the Schneider Electric white paper says:
Military personnel, firemen, and EMTs repeat drills over and over until the right responses become “second nature,” even in the most extreme conditions. The same should apply to data center technicians, who operate in an environment where every second counts in emergency situations – for both safety and financial reasons. Actual emergencies are not the time to reference an unfamiliar procedure that may be misapplied in the heat of the event.
Companies need to set aside time to conduct drills to practice emergency response procedures, and to do so routinely. Such drills should be part of a broad curriculum that also includes training on operating and maintenance procedures, system theory and all other aspects of the data center operation.
A second mistake is failure to document all of the processes and procedures needed for proper site operations and maintenance, which is a necessity for training and process improvement purposes. This includes not only the emergency procedures, but all operating, maintenance and even administrative activities such as preventative and corrective maintenance, facility walkthroughs, and shift turnover communications. In addition, accurate facility drawings, maintenance records, parts inventories and escalation protocols are needed to prevent failures as well as respond to them. Collectively these documents form the basis for effective operations and create a foundation for promoting proactive, continuous improvement.
Many failures result from errors made during installation and maintenance activities, so a third mistake is not having a robust change control process in place. This process should follow accepted guidelines for change and configuration management, comprehensively addressing topics such as:
- Operating procedure creation, review and approval
- Risk analysis and hazard mitigation (both safety and operational)
- Proper communications before, during and after the change procedure
- Exception handling, including back-out procedures
- Vendor supervision and management
Finally, many companies put processes in place then just leave it at that. They err in not regularly revisiting their processes to improve on them, thereby missing the opportunity to continually make their operation more efficient, reliable and cost-effective. It’s crucial to have a plan for continuous process improvement, which should include such techniques as providing a feedback section in all operating procedures, to record variances and suggestions for improvement that are incorporated into the next version of the procedure.
For more tips on developing an effective operations program, check out the Schneider Electric white paper, “Top 10 Mistakes in Data Center Operations: Operating Efficient and Effective Data Centers.”