Optimizing data center cost and reliability should be the goal of every data center owner. Failure to do so could impact competitiveness in the marketplace.
In a video interview, “Maintenance Strategy – Driving Cost Down while Managing Risk,” part of the Schneider Electric series on “7 Winning Strategies for Building a Successful Data Center Business,” John Sheputis, President, InfoMart Data Centers, outlines three tactics to support these mandates.
- Simplicity in design
- Predictive maintenance technology
- Maintenance best practices
Simplicity in design
Studies have shown that 75 percent of all down time is caused by human error, but the simpler the design, the less chance there is for oversight. Mistakes often occur during operation and maintenance switching activity, emergency break-fix work efforts or capital projects.
Simple design also reduces construction cost, however, over simplification could lead to reduced reliability. The trick is to find the right balance between design simplification and design resiliency.
The amount of resiliency and redundancy in your design should be based on the needs and risk tolerance of your clients. Two design simplifications that can achieve the balance are:
- Distributed redundant UPS systems: provide 2N critical power distribution to the server at an N+1 system cost.
- Elimination of raised floor: APC White Paper 19 makes the case that the raised floor in a data center is no longer needed but most data center designs today still employ raised floor environments.
Predictive maintenance technology
Deployment of a UPS battery monitoring system minimizes human interaction with the battery plant and reduces maintenance costs by performing hundreds of measurements in seconds versus the hours it would take a technician to manually take readings with instruments.
This technology also helps to ensure critical power availability by providing 24/7 monitoring of the health of your battery plant and warning you when batteries are failing. Without it, you are dealing with a point and time measurement of the health of the batteries. However, batteries can deteriorate rapidly when they are nearing the end of life and they may fail between periodic maintenance checks.
IR scans also play a role in predictive maintenance strategy. They are non-invasive and don’t involve switching activity and are great predictors of electric system performance and the need for maintenance.
Maintenance best practices
Maintenance best practices should be evaluated from two perspectives; the quality of people and processes and the scope of maintenance performed on equipment.
If your staff does not possess the right skill sets, the risk of an incident due to operator error greatly increases. It’s important to invest in people and establish change management processes.
A long-standing practice in the industry has been to perform maintenance on critical systems and equipment after hours and on weekends, so, if an incident occurs, it would have the least intrusive impact.
I think that we should question this practice.
If incidents are most likely to occur due to human error, when is human error most likely to occur? During the middle of the night when the individual is out of sync with a normal sleep pattern? Or during the day when rested and alert? Consideration should be given to performing maintenance during the day, on Saturday or Sunday, instead of in the middle of the night.
Another school of thought says increasing the amount of maintenance activity lowers operational risk. However, if human error is the number one driver of incidents in data centers, then perhaps less maintenance will result in greater reliability, especially in the highly redundant system designs we see today.
Plus, all too often, maintenance is performed based on specific frequencies, which are sometimes excessive. Frequency based maintenance should be replaced with reliability centered maintenance.
For example, instead of changing the oil in a chiller or generator based on frequency, take an oil sample and have it analyzed to determine if the oil should be replaced.
Every time we perform maintenance and engage in switching activity we run the risk of causing an incident. Ultimately, it’s about finding the right balance between your infrastructure design, applying the right technologies and establishing the right maintenance best practices.
Simplicity should be our goal, as it can achieve the same reliability for less cost.