Quote of the Day
If I had to give credit to the instruments and machines that won us the war in the Pacific, I would rate them in this order: submarines first, radar second, planes third, bulldozers fourth.
— Admiral Bull Halsey. I have always found the performance of US submarines during WW2 amazing considering the challenges that they faced with faulty torpedoes.
One of the more distasteful tasks I need to do is make estimates of annual product failure rates using MTBF predictions based on part count methods. I find this task distasteful because I have never seen any indication that MTBF predictions are correlated in any way with field failure rates. This is not solely my observation – the US Army has cancelled its use of part count method MTBF predictions (i.e. based on MIL-HDBK-217). However, the telecommunications industry has continued to use these predictions through their use of Telcordia SR-332, which is similar to MIL-HDBK-217. If you want a simple example of an SR-332-based reliability prediction, see this very clear example from Avago. The parts count method assumes that components fail at a constant rate (green line in Figure 1).
While this calculation is simple, it is useful to discuss why the results generated are so useless – in fact, I would argue that they drive incorrect business decisions for things like required spare parts inventories.
The basic math here is shown in Equation 1.
- λAnnualized is the failure rate per year.
- λ is the failure rate (usually expressed per billion hours).
- TYear is the number of hours in a year (8760)
- MTBF is the Mean Time Between Failures.
The shortcomings of the part count method are many:
- It assumes a constant failure rate, memory-less failure rate
- A new part fails at the same rate as an old one.
- Total operating hours is all that is important.
- This means 1000 parts operating for one hour fail is the same as one part operating for 1000 hours.
- It assumes that a part's reliability is predictable based on some simple mathematical function.
- I see wide variations in part failure rates that depend on the part's application and how the vendor build it.
- I frequently see lot-dependent component failures.
- Most part failures are not random.
- They are caused by manufacturing issues, misapplication, environmental issues (e.g. lightning), etc.
- In some cases, they are caused by wear-out (e.g. I just dealt with a rash of dried-out, ten-year old electrolytic capacitor failures)
- It assumes that all vendors have the same quality level.
- It assumes that system's failure rate is the sum of all the individual component failure rates.
- Many issues are related to interaction problems.
- Ignores the fact that how you hook up the parts matters.
- Installation issues are a major source of equipment problems.
- I frequently see installations where there is contamination or wind-generated motion that causes device failure.
- I have reported on this blog numerous cases of insect infestation.
- These issues drive field failure rates far more than random part failures.
- The "elephant in the reliability room" is that software failures tend to dominate over hardware failures.
Figure 2 shows my calculations for a made-up example.
In general, I find all formal procedures distasteful. In this case, people want a calculation done in a specific manner – and I dutifully comply. However, I know the answer does not reflect reality. In general, these computed annualized failure rates are ~10x what I would consider acceptable annual failure rates for actual products.
I recently had a conversation with an Australian service provider who was having trouble predicting the number of spare parts he needed to have in inventory. The problem he was having traced directly back to this calculation.