Reports and Procedures: Crucial Elements in Reducing Human Error in Data Center Operations

This audio was created using Microsoft Azure Speech Services

It’s no secret that the single greatest contributor to data center downtime is human error, a fact we’ve mentioned in previous posts such as this one. You can’t spend enough money to eliminate human error but you can mitigate the risk with proper tools, one of which is documentation.

Your data center likely has stacks of documentation from the vendors that supply your various data center systems, and that can certainly be vital information. But many data centers are missing detailed procedures that their teams need to perform everyday tasks and reports that help them keep on top of the condition of the data center. In this post, we’ll walk through some of the various types of procedures and reports and why you need them.

1. Procedures

Virtually every task that takes place in the data center should have a written procedure attached to it. The most common types of procedures are:

  • Standard Operating Procedure (SOP): A SOP details a common operating procedure and can be referenced whenever needed. Examples include how to rotate equipment using the Building Management System or create a work ticket.
  • Method of Procedure (MOP): A MOP is the detailed, step-by-step procedure that is used when working on or around any piece of equipment that has the ability to directly or indirectly impact the critical load. MOPs may reference a SOP that needs to be performed in the course of the procedure.
  • Emergency Operating Procedure (EOP): An EOP is an emergency response procedure for a predicted or previously experienced failure mode. It covers how to get to a safe condition, restore redundancy, and isolate the trouble. EOPs may also cover disaster recovery scenarios.

Procedures help you achieve several objectives and benefits, including:

  • Process formalization: Writing down a procedure forces the writer to examine it in a level of detail and logic that may not otherwise occur, and to cover aspects such as safety, tools, material inventory and a back-out plan.
  • Peer review: Having a written procedure facilitates peer review and other types of oversight, creating an opportunity for process improvement.
  • Proper implementation: A well constructed procedure document provides a framework for performing activities in the proper sequence, empowers individuals to stop work when events deviate from expectations, and creates a written record of who did what and when.
  • Training: Having written procedures saves time in training material development, helps ensure appropriate topics are thoroughly covered and provides a framework for testing.
  • Record keeping: Completed procedures are an important record of activities performed, which is of value to the technical team and provides an auditable record of compliance with internal and external regulations.

2. Reports

Among the various types of reports that you need to track the status and condition of the data center are:

  • Site Walkthrough Report: A checklist filled out each shift that verifies the activity was performed and documents equipment status.
  • Shift Report: A shift-by-shift report of all significant activity that occurs in the facility. It forms a continuous narrative that the incoming crew can use to determine everything of consequence that occurred since the last time they were on duty.
  • Deficiency Report: A detailed account of a specific deficiency or problem along with any available metrics, risk assessment, suggested remediation and cost estimates. This report is used to document issues and is useful in justifying any related expenditures to decision makers.
  • Incident Report: A detailed account of a specific incident with a step-by-step timeline that tracks what occurred, who was involved, when notifications were sent, what immediate actions were taken and where changes in status took place.
  • Failure Analysis Report: A root cause analysis, typically following up on an incident report, to determine the underlying cause(s) of the event in order to prevent further occurrences.
  • Lessons Learned Report: A method of documenting important lessons learned in the course of operating or maintaining a facility that allows technicians and operators to benefit from the experience of others.

To learn more about data center documentation, read the Schneider Electric white paper number 4, The Importance of Critical Site Documentation and Training.

Tags: , ,