Understanding Mean Time between Failure in the Data Center – Part 1

This audio was created using Microsoft Azure Speech Services

Lately, I’ve been seeing data center RFP’s come through with requests for the MTBF (Mean Time Between Failure) of a product.  Nothing much else is specified other than “what is the MTBF?”  This is very problematic because MTBF calculations range the gamut of statistically impractical to just plain black magic.  Since so much emphasis in placed on MTBF at times, it is critical to understand the true meaning of this value.

MTBF is typically expressed in hours and is defined as the average number of hours the product will operate in service before experiencing a “failure.”  I place “failure” in quotes because the definition of what constitutes an actual failure is critical.  As an example, in the data center UPS world, some manufacturers may define anything other than the output inverter on line and supplying the load as a failure.  Other manufacturers may accept going to bypass as satisfactory operation and not a true failure.  I would hope that all manufacturers consider a UPS induced load drop a significant failure.  Regardless, right at the outset, there is the possibility for an “apples to oranges” scenario.

There are numerous ways to predict MTBF including procedures and calculations based on military standards.  Linking measured actual field failure rate to MTBF is another popular methodology.  There are more to list but the key point here is that each of these methods has their own pitfalls to avoid.  Therefore, it is easy to see how, depending on the definition of failure, the method chosen, the assumptions made, and the extent to which pitfalls are avoided, very different MTBF’s can be calculated for the exact same product.  The only completely accurate way to calculate MTBF for a product or system is to wait until each and every unit ever placed into operation has failed and then do the calculations.  This is obviously impractical so we are left with estimating MTBF to the best of our ability.  This can lead to numbers that are clearly nonsensical but that may have some value in a relative sense.  That is, the reliability of two products or systems can be compared IF calculated in EXACTLY the same way and IF ALL the same assumptions are made.

I’m going to give an example of this in next week’s blog.  In order to do so I will also explain a common method for calculating MTBF using the Annual Failure Rate or AFR.  Hopefully, by the end of next week’s blog you will have good grasp of MTBF and be able to avoid its pitfalls in your data center. If you would like to look into this topic a little deeper, please check out white paper 78, “Mean Time Between Failure: Explanations and Standards“.


Please follow me on Twitter @DomenicAlcaro


About Domenic Alcaro:

Domenic Alcaro is the Vice President of Mission Critical Services and Software. Prior to his current role, Domenic held technical, sales, and management roles during his more than 14 years at Schneider Electric. In his most recent role as Vice President, Enterprise Sales, he is responsible for helping large corporations improve their enterprise IT infrastructure availability.






Tags: , ,