As Annie and I pulled out of our driveway to head up to New Hampshire to join her extended family for vacation, I was overcome with stomach pain and a sense of doom. No, it wasn’t because I have an aversion to being stuck in a cabin for a week with my in-laws…I don’t…although I (half) jokingly like to refer to the resort we go to as “Jonestown”, much to the chagrin of my wife. Anyway, the pain and waves of nausea got worse and worse during our 4 hour drive north. Annie suggested turning back a few times. But I kept thinking (hoping) it would pass. It never did. After a fun night of fever and indigestion I got up the next day to intense pain and pressure in my lower right abdomen….yup, appendicitis. So on my second day of vacation I had emergency surgery to remove my massively inflamed and apparently useless organ. The surgeon said it was the worst case of an appendix that hadn’t yet burst he had ever seen. Had I waited any longer I would’ve been subjected to a host of horrors brought on by the dangerous infection that results from wishful thinking and procrastination. I was fortunate I hadn’t tried to tough it out any longer. After just 1 week post-surgery, I’m already near 100%. Had it burst, I’d still be very sick.
Naturally, this whole experience makes me think about the parallels to managing a data center. Successfully managing appendicitis – or health in general really – requires continuous monitoring, knowledge & training, patient-doctor cooperation, as well as proactive & decisive action. In these ways, the needs of a data center are very similar. Here is a list of lessons that come to mind…
- Detect problems BEFORE they become problems by using good DCIM and BMS monitoring tools.
- If you’re unsure of the status and overall health of your physical infrastructure systems, then hire someone to come in and do an assessment, particularly if the systems are 10 years old or so. Assessments can determine current capacities, level of redundancy, map out dependencies, and calculate PUE, as well as find problems that could undermine availability.
- Perform regular preventative maintenance checks on critical systems such as the generator, switchgear, breakers, UPSs, batteries, chiller plants and air handler units; be aware of when things tend to wear out in these systems and need replacement. And make sure you’re managing your spare parts inventory properly.
- Understand from equipment vendors what upgrade or modernization options are available for aging equipment.
- Establish Emergency Operating Procedures (EOPs) for all high-risk failure scenarios such as the loss of a chiller plant, failure of the genset to start up, and so on.
- Create clear, peer-reviewed Methods of Procedures for all maintenance activities. MOPs should not only be used to control all work activities, but should also serve as a tool for operational procedure development & review, risk analysis and communication, work practice standardization, and vendor/contractor supervision.
- Ensure facilities and IT personnel work together and are well trained to safely and quickly respond to events that threaten safety, security, or system availability. Perform regular drills and simulations to ensure training is working and has sunk in.
By definition, a mission-critical facility has to be kept in good health. To be unaware of its condition…to allow it to fall into disrepair…to be lackadaisical about maintenance and unprepared to respond decisively to emergencies is all a recipe for letting your data center’s appendix burst. The cost and impact of downtime will quickly make you regret not taking better care of the facility.
To learn in much more detail about the lessons listed above, check out these white papers I authored…
White Paper 217, “How to Prepare and Respond to Data Center Emergencies”
White Paper 196, “Essential Elements of Data Center Facility Operations”
White Paper 214, “Guidance on What to Do with an Older UPS”