Quote of the Day
A theory has to be simpler than the data it explains.
I was in an interminable meeting the other day where we were discussing the MTBF and availability of a system. My issue with this discussion is that each person in the room preferred to think about these terms in different ways. In this post, I will show that the four people in the meeting were actually in violent agreement and simply did not understand that their arguments were mathematically equivalent.
I wish I could say that this was the first time in my career that this had happened, but that would not be true. It happens all the time.
I will try to summarize the argument as simply as I can:
|Person 1||The system must conform to GR-909 – a telecommunications specification that specifies system availability.|
|Person 2||The system must have an availability of at least 99.999%.|
|Person 3||The system must have a downtime (i.e. unavailability) of less than 5 minutes per year.|
|Person 4||The system must have a Mean-Time-Between Failure (MTBF) of 68.4 years.|
- The ratio of (a) the total time a functional unit is capable of being used during a given interval to (b) the length of the interval. For example, a unit that is capable of being used 100 hours per week (168 hours) would have an availability of 100/168. In high availability applications, a metric known as "nines", corresponding to the number of nines following the decimal point, is used. With this convention, "five nines" equals 0.99999 (or 99.999%) availability (Source).
- Mean Time Between Failures (MTBF)
- MTBF describes the expected time between two failures for a repairable system (Source).
- Mean Time To Repair (MTTR)
- MTTR represents the average time required to repair a failed component or device (Source).
- Mean Time to Failure (MTTF)
- MTTF denotes the expected time to failure for a system that requires a repair with an MTTR of a given value. For our purposes here, .
- Failure Rate (FR)
- Failure rate is the frequency with which an engineered system or component fails, expressed in failures per unit of time (e.g. 1E9 hours).
Figure 2 summarizes my demonstration of the equality of each person's argument.
It took about 30 minutes to get everyone in the meeting to understand that they all were stating the same requirement. The problem originates in that different departments work in terms of different units. System engineers and industry specifications speak in terms of availability. Hardware engineers speak in terms of MTBF. Customer Service people speak in terms of downtime per year.
The "elephant in the room" was that fact that most systems fail because of software bugs and these reliability calculations ignore software bugs.