As a commonly used IT buzzword, MTBF, which stands for “mean time between failure”, is commonly used across many industries. Unfortunately though, it has become widely abused in some.
The white paper, Mean Time Between Failure: Explanation and Standards, published by energy management specialist, Schneider Electric, says that numbers (relating to MTBF) are thrown around without an understanding of what they truly represent. The paper further argues that while MTBF is an indication of reliability, it does not represent the expected service life of the product.
Ultimately, the paper concludes, an MTBF value is meaningless if failure is undefined and assumptions are unrealistic or altogether missing. This is because MTBF is largely based on assumptions and definition of failure, and attention to these details are paramount to proper interpretation. Importantly, the business reliability target may not be achieved without a solid understanding of MTBF.
So what is a failure? According to the paper, MTBF is often quoted without providing a definition of failure. The energy management specialist company believes that this practice is not only misleading, but also completely useless.
A similar practice would be to advertise the fuel efficiency of an automobile as “kilometres per tank” without defining the capacity of the tank in litres.
To address this ambiguity, there are two basic definitions of a failure in line with IEC-50:
* The termination of the ability of the product as a whole to perform its required function.
* The termination of the ability of any individual component to perform its required function but not the termination of the ability of the product as a whole to perform.
The following two examples published in the paper illustrate how a particular failure mode in a product may or may not be classified as a failure, depending on the definition chosen:
* Example one – if a redundant disk in a RAID array fails, the failure does not prevent the RAID array from performing its required function of supplying critical data at any time. However, the disk failure does prevent a component of the disk array from performing its required function of supplying storage capacity. Therefore, according to the first definition, this is not a failure, but according to the second definition, it is.
* Example two – if the inverter of an uninterruptible power supply (UPS) fails and the UPS switches to static bypass, the failure does not prevent the UPS from performing its required function, which is supplying power to the critical load. However, the inverter failure does prevent a component of the UPS from performing its required function of supplying conditioned power. Similar to the previous example, this is only a failure by the second definition.
However, if there existed only two definitions, then defining a failure would seem rather simple. Unfortunately, when product reputation is on the line, the matter becomes almost as complicated as MTBF itself. In reality, there are more than two definitions of failure, in fact, there are infinite.
The paper further investigates the fact that definitions of failure are dependent on the type of product. Manufacturers that are quality driven track all modes of failure for the purpose of process control which, among other benefits, drives out product defects.
Therefore, additional questions are needed to accurately define a failure. Is customer misapplication considered a failure? Are load drops caused by a vendor’s service technician counted as a failure? Is it possible that the product design itself increases the failure probability of an already risky procedure? Is the expected wear out of a consumable item such as a battery considered a failure if it failed prematurely? Are shipping damages considered failures?
It becomes clear that the importance of defining a failure should be evident and must be understood before attempting to interpret any MTBF value.