Power and cooling systems like UPSs, PDUs, air handlers, and condensers are essential for ensuring the critical IT loads of a data center remain operational. But if those physical infrastructure systems aren’t maintained, they may not do the job they were intended to do. Maintenance programs are therefore an essential part of a data center operations plan. In addition to minimizing downtime, these programs help the systems run efficiently and maximize their life expectancy, ultimately reducing the opex of the data center over its lifetime.
But not all maintenance programs are created equal. We believe there are five key elements to consider when selecting a service partner for your maintenance. Take time to investigate these for any potential provider before purchasing a maintenance contract. Our new white paper, Attributes of an Effective Maintenance Program for Data Center Physical Infrastructure, discusses why they’re so important to the outcome.
- Expertise of the maintenance personnel
- Quality assurance
- Onsite response time
- Remote monitoring capability
- Comprehensive onsite inspection
1. Expertise of maintenance personnel
Human error is one of the leading cause of downtime in data centers. If maintenance activities are not done with the right expertise, they may unintentionally introduce defects into the systems. Service personnel should be experts on the systems they maintain, receive ongoing training, be qualified and certified by organizations like OSHA, EPA, local state licensure to perform the work, and be knowledgeable and trained on all safety protocols. They should have access to vendor tools and software that enable better diagnostics of the systems as well as the latest field service bulletins created by the system vendors that alert them to trending issues. It’s also important they have access to a team of sustaining engineers, application engineers, and design technical support teams if escalation of the problem becomes necessary.
2. Quality assurance
Your service provider should have well-documented and well-established processes and procedures. Check they have things like standard operating procedures (SOPs), methods of procedures (MOPs), and change management processes. They should also document their training program and have records of personnel trainings so you can validate the person/people coming onsite are qualified.
3. Onsite response time
A big consideration when choosing a service provider should be how quickly they can get onsite and how quickly they can repair the system if an unexpected problem occurs. Some service contracts offer service level agreements (SLAs), for instance, guaranteed 4-hour response to getting someone onsite. Find a provider that meets an SLA that aligns with your availability / criticality needs of your data center. Smaller third-party service providers may not be able to offer as fast a response time, due to their limited resources and lack of global field service coverage. Getting the system back up and running as fast as possible also depends on the technician’s effective diagnosis and then correct repair with spare parts. This ties back to #1 above. If they don’t have the right expertise and resources for escalating problems, the service tech may end up with a trial and error approach, which leads to longer than necessary resolution times, and potentially making the situation worse before better.
4. Remote monitoring capability
Preventive maintenance is greatly improved when digital monitoring of your assets is possible. Remote connectivity helps enable quicker diagnostics of problems, that sometimes turn out to be addressable completely remotely, thereby avoiding the in-person service visit altogether. With connected systems and large volumes of data, algorithms can be created to predict when downtime is going to occur. This is enabling the industry to make a shift from calendar-based preventive maintenance to condition-based preventive maintenance, where you minimize onsite visits and only have a service tech onsite before a failure is imminent.
5. Comprehensive onsite inspection
When a service tech comes onsite to conduct preventive maintenance, having a holistic approach to the system is crucial. A visual inspection is important, in a non-invasive way, to identify potential problems like dust or debris on a condenser, dirty condenser coils, bloated or leaky batteries, etc. An environmental inspection is also vital because the surrounding space the system is placed in can have a direct impact on the function and life of the system. This includes things like the temp/humidity of the space outdoor air quality, water quality, etc. For example, northern climates may have to contend with snow and ice on outdoor units; or humidifiers may require additional maintenance if the water is hard and leaves deposits that could clog the system. The third element of a holistic inspection is the electrical/mechanical inspection. This is to make sure the systems are performing according to the defined specs. Look for a service provider that focuses on non-invasive approaches to inspecting systems. Infrared thermography, for example, is effective in detecting loose connections.
Evolution of maintenance
As technologies like data analytics and artificial intelligence (AI) become more widely adopted, and data center systems become more and more “connected” and “smart”, we expect to see a shift from calendar-based maintenance to condition-based maintenance. A service vendor should be your trusted advisor. Make sure you choose one that you feel confident will (1) help ensure your systems operate as expected today, but also (2) have the foresight and capabilities to evolve as technologies enable more advanced, lower risk forms of maintenance. Check out the white paper for more on this topic.
Leave a comment below or check out other blog posts from the Data Center Science Center team.