Understanding Mean Time between Failure in the Data Center – Part 2

This audio was created using Microsoft Azure Speech Services

In this blog I am going to finish the topic I started last week’s blog on understanding MTBF.  If you recall, I stated towards then end of last week’s blog that MTBF calculations can lead to numbers that are clearly nonsensical but that may have some value in a relative sense.  That is, the reliability of two products or systems can be compared IF calculated in EXACTLY the same way and IF ALL the same assumptions are made.  I’m going to give an example of this but first let me explain a common method for calculating MTBF using the Annual Failure Rate or AFR.  The AFR is the percentage of a population of units that are expected to “fail” in a calendar year.  Again, note that “fail” has been highlighted in quotes because a precise definition of what constitutes a failure is required.  For a product that is in use continuously (i.e., turned on and left on until failure), the AFR and MTBF are mathematically linked according to:

MTBF(hrs) = 876,000 / AFR(%)

(8,760 hours per year and a 100 multiplier to convert percentage yield the 876,000 number)

Therefore a unit with an AFR of 10% has an MTBF of 87,600 hours.  Since field returns occur on a calendar and not an operating hour basis, it is only actually possible to measure AFR directly.  MTBF is then estimated from the AFR by applying the above formula.  There is a vast library of assumptions and sometimes imprecisely known factors that affect AFR but we’ll leave that discussion for a later time.  Now, let’s move on to my example.

If 100 standard incandescent light bulbs (forgive me Green Gods for not using CFL’s) are all placed into continuous operation (turned on and left that way) on day 1 and exactly one month later 1 has failed, we have an AFR of 1/100*12 = 12% at that point in time.  Using the above formula, this gives us an MTBF of 73,000 hours, which equates to 8.3 years.  Now, we know from experience that ALL the light bulbs will likely have failed by the end of 1 year so an 8.3-year MTBF is meaningless in an absolute sense.  However, when comparing a light bulb from manufacturer A with one from manufacturer B, it is valid to consider the MTBF’s of each manufacturer IF AND ONLY IF they were calculated the exact same way with all the same assumptions.  Also note that light bulb manufacturers have the luxury of manufacturing a product whose failure is not usually “mission critical,” whose lifespan is relatively short, and whose volume is very large, so they can calculate MTBF with better accuracy.  As such, you’ll note an “MTBF” for continuous operation of 2000 hours or roughly 3 months for a standard incandescent bulb as common.

I’ve seen MTBF’s for UPS’s quoted in the range of 100 years.  This is not a practical number but if you can assume that, when comparing 100 years to the number of another UPS manufacturer the MTBF was calculated in the EXACT same way with all the same EXACT assumptions, then you might be able to use it in making some decisions.  This is not likely so please proceed with great caution when utilizing this metric. If you are interested in getting a more technically feel for MTBF I recommend white paper 112, “Performing Effective MTBF Comparisons for Data Center Infrastructure”.

 

Please follow me on Twitter @DomenicAlcaro

 

About Domenic Alcaro:

Domenic Alcaro is the Vice President of Mission Critical Services and Software. Prior to his current role, Domenic held technical, sales, and management roles during his more than 14 years at Schneider Electric. In his most recent role as Vice President, Enterprise Sales, he is responsible for helping large corporations improve their enterprise IT infrastructure availability.

 

 

 

 

Tags: , ,

Conversation

  • Hello Domenic,

    Thanks for those two articles, it really helped with understanding MTBF, as some other scientific papers on the matter were a bit over my head. I read some, and I just didn’t GET how I am supposed to use the number given its highly relative nature.

    So there are no industry-wide standards and/or certifications for this after all. I think there should be!

    I just tried to make some sense out of hard drive MTBFs, but the companies specifying those numbers don’t even tell us how they determine them and how to use them. See the Hitachi Global Storage “definition”:

    “MTBF target is based on a sample, aggregate population of a drive family and is estimated by statistical measurements and acceleration algorithms under nominal operating conditions. MTBF ratings are not intended to predict an individual drive’s reliability. MTBF does not constitute a warranty”

    Yeah, so what exact measurements? How many drives? For how long have they been tested? Nobody tells you that. It’s truly utterly useless if we can’t even know how the MTBF was calculated!

    In that sense, I think AFR is the only thing I’ll ever pay attention to, unless I’m an early adopter of any specific technology and have no access to any AFR numbers yet…

    Best,
    Michael

Comments are closed.