How to Assess Failure Rate Claims for Data Center Infrastructure

This audio was created using Microsoft Azure Speech Services

Mean Time Between Failure (MTBF) is the most common criterion used when comparing data center physical infrastructures. But while MTBF actually can be calculated as a formula (based on another formula determining annual failure rate, or AFR), it’s dangerous to assume that all MTBF numbers mean the same thing.

MTBF can be defined in different ways, depending on what data is chosen for measurement and inclusion in the formula. This can make comparisons between vendor systems difficult, if not meaningless.

However, one method of determining MTBF — the “field data” method — uses actual field failure data, thus making it a more accurate predictor of failure than simulation, especially for products with healthy field populations.

Even so, the process of collecting and analyzing field data – defining a product population, determining sample time ranges for data collection and, perhaps most critically, defining “failure” – can vary widely, making system comparisons challenging.

To make meaningful cross-system comparisons, data center owners must recognize and understand the underlying assumptions of the vendor and other variables. Below are some of the most important variables affecting MTBF and AFR.

Product function, application and boundaries

Are the products you’re comparing identical? If not, how do you account for the differences?

Let’s say two vendors sell UPS systems and supply MTBF values for their products. One vendor’s system includes batteries, while the other vendor considers batteries to be “outside” the boundaries of its system. All well and good, but when each vendor has a failure associated with external batteries, only one will count that data toward its system MTBF number.

Population size

It’s important to base an MTBF estimate on numbers derived from a healthy population size of products. An insufficient sample size will make an MTBF estimate meaningless.

Definition of failure

It’s crucial to the validity of an MTBF comparison that the specific definition of failure for each system is spelled out. What one vendor might consider a product failure, another may define as a “customer misapplication.”

One vendor might consider damages from shipping to be a “failure” of design, while another excludes shipping damages from failure data. A vendor might consider a failure during installation to be a product failure, while another vendor attributes the failure to the technician, and thus not part of the MTBF estimate. A vendor might count a “cascading” failure as one failure, while another will count each system brought down as a separate failure.

Time between end of sample period and AFR calculation date

For an MTBF comparison of similar product types, it is important that the delay between the end of the sample period and the calculation date for the annual failure rate (AFR) be similar.

Data collection documentation

A vendor should be able to supply data center owners with a clearly defined and documented data collection process for its MTBF estimate. Three particular process problems, if evident, indicate potential questions about the vendor’s MTBF calculation:

  • The vendor doesn’t have uniform global tracking and storage systems for failure and repair data.
  • The vendor has poorly defined processes for categorizing returns.
  • The tracking system is primarily manual. Automated processes are more accurate.

To learn more about failure-rate comparisons, read the APC by Schneider Electric white paper, Performing Effective MTBF Comparisons for Data Center Infrastructure.

Tags: , , ,


  • Here is some information about MTBF:

    If you are in Québec, you can call us and we will help you.

    Thank you.

Comments are closed.