Quote of the Day
Innovation has nothing to do with how many R & D dollars you have. When Apple came up with the Mac, IBM was spending at least 100 times more on R & D. It's not about money. It's about the people you have, how you're led, and how much you get it.
— Steve Jobs
Both manufacturers and their customers are very frustrated when customers receive products that fail "out of the box". I have recently worked on an "out of the box" failure issue that involved units that passed our acceptance tests at the factory but would fail in the field and I thought it might be useful to discuss how this happens and what manufacturers do to minimize the field failure rate (Figure 1 shows a modern electronic assembly facility [source]).
When products are released to production, we generally have a pretty good idea as to what our manufacturing yields and our defect level (aka out of the box failure rate) will be. We have mathematical models for how things fail in manufacturing and generally those models work well. However, all modeling involves making approximations and assumptions. My focus in this note is to show how a minor process problem that our tests could not detect caused some issues for our customers.
We work hard to ensure that our products have a very low defect level, but we cannot make our defect level zero because it is impossible to perform a test for every possible failure mode. All tests have the potential for "false negatives" -- your test fails to find a defect where one exists -- and false negatives are the primary source of "out of box" failures.
In the particular case, I am going to address here, how a process problem occurred that produced a defect in ~0.8% of the units that our manufacturing tests missed. This meant that our defect level rose to 0.8% for a while, yet our models said that we should have a defect level of 1 in 10,000 units. I will discuss how we do this modeling and why it failed in this case.
Manufacturing Failure Modes
Classes of Manufacturing Defects
There are an endless variety of ways that we can categorize manufacturing defects. Here is the categorization I use:
- connection failures
Electronic assemblies often have thousands of interconnects between hundreds or thousands of parts. Every interconnection must be correct. It turns out that connection failures are easy to test for using shorts and opens tests. We actually have good metrics for our ability to detect this type of defect, which we call test coverage.
- environmental failures
Some failures only occur when the assembly is operated at specific temperatures. These failures can be found by testing the hardware at various temperatures. This can be done, but is expensive for high-volume products and it is often not done (we always do it).
- speed failures
Some failures only occur when a specific path in a circuit needs to operate at a certain speed and it cannot. Most of the time, there is a circuit path that have so much delay in them that they cannot switch fast enough (called a "long path"). There are also circuit paths that might be so fast that they do not have enough delay in them to switch properly (called a "short path"). These defects are difficult to test for and really must eliminated by using proper design practices. To provide some level of empirical assurance, we generally must test our systems over temperature using software that operates as similarly as possible to how the customer will use the hardware. This test will not catch every short path or long path problem, but it does ensure that most of the paths are tested. This testing is expensive, but it is the only way to catch most of these problems.
As electronic systems become more complex, the difficulty of finding manufacturing defects grows much more difficult -- this growth in test difficulty is often referred to as the "combinatorial explosion". Every electronic system can be modeled as a state machine with a large number of state variables, inputs, and output. The complexity of this state machine so great that you cannot possibly test every possible combination of state, input, and output variable in an economically justifiable amount of time.
So much of manufacturing test focuses on the most likely failure modes, which normally involve interconnect failures. Interconnect failures would typically be caused by:
- soldering problem
Just today I saw a PCB that had two pins that were not soldered. It happens. We do use optical and x-ray inspection gear to minimize this type of defect.
Most electronics manufacturing sites are very clean (Figure 1). However, contamination still happens.
- Printed Circuit Board (PCB) defect
PCBs do have flaws, but fortunately continuity testing by the PCB fabricators can minimize their occurrence. I do relate a personal story of a particularly difficult PCB problem I faced many years ago later in this post.
Stuck-At-Value Failure Modes
The most common failure modes tend to be of the type referred to as "stuck at" failures. There are a number of "stuck at" failure types:
- stuck-at-0 (SA∅)
A node is stuck at a logic "0" value.
- stuck-at-1 (SA1)
A node is stuck at a logic "1" value.
- stuck-at-neighbor (SAN)
A node is erroneously connected to a nearby node.
- stuck-at-open (SAO)
A failure mode usually seen with CMOS circuits, it occurs when a transistor fails in a way that is does not conduct current when it is supposed to, it is said to be stuck-at-open. This fault manifests itself as a high impedance state at the output node for a logic state and under certain conditions the node voltage stay "stuck" at its previous logic state. Because the circuit "remembers" its previous state, these failures are often referred to as memory failures.
The most economical failure modes to test for are the SA∅ and SA1. The following discussion focuses on these SA∅ and SA1.
The following definitions will be used in the analysis to follow.
- For the discussion here, yield is defined as the number of units passing our manufacturing test process divided by the number of units going into that process over a specified period of time.
- Defect Level
- Defect level is percentage of shipped units that customers find defective. Estimating your defect level requires monitoring your customer return data base very carefully and filtering out bogus failure reports. Many reports of customer failures actually have nothing to do with a problem with product manufacturing. I would argue that well over half of reported customer failures are actually customer training issues, which we call No-Trouble-Founds (NTFs). This rate of NTFs has been much the same at all the companies I have worked for (five companies at this point).
- Test Coverage
- The percentage of defects that can be found during manufacturing test versus the total number of possible defects. In many ways, test coverage is a bit of a fantasy. We have certain types of potential defects that we know and understand well enough that we can actual count them. For example, we can do very thorough shorts and opens test on PCBs.
However, I have seen many PCBs fail in use even though they passed a shorts and opens test. One of the most difficult troubleshooting tasks I have dealt with occurred at HP and involved PCB artwork with a tiny crack in a trace. This trace would become open under certain environmental conditions in the field. That took forever to track down because the problem occurred during a rare software operation and in a seemingly random manner. When we looked at that trace on the film, I remember wondering if something so small could cause a problem -- it turns out it did. This really taught me the importance of attention to detail.
My objective here is to show you how sensitive this type of analysis is to the validity of the assumptions. I have plenty of evidence that supports the model I am about to present when all interconnect failure modes have roughly the same probability of occurrence. I will then discuss how the model completely fell apart when this assumption was violated. This is just a warning about modeling. To quote George E.P. Box, a famed statistician, “Essentially, all models are wrong, but some are useful.” We must always remind ourselves that models are used to provide us insight, but they do not necessarily reflect reality.
Defect Level Formula
I am going to apply a standard defect model for integrated circuits to electronic assemblies. The basic assumptions are the same and the resulting formula has modeled defect level for assemblies well for me in the past. I am not the only one to use this approach (example).
Equation 1 shows a commonly used model for chip-level defect level as a function of yield and test coverage. You often see companies modify Equation 1 to to make it fit there particular circumstances (e.g. Toshiba). In this post, I will be working with Equation 1 unmodified because I want to illustrate in general terms what happened in my particular situation.
- DL is the defect level of our process.
- Y is the yield of our process.
- TC is the test coverage of our process.
Figure 2 shows my derivation of Equation 1, which closely follow the derivation from this reference.
Figure 3 shows a plot of Equation 1 for various yield and test coverage values. Observe how the defect level really starts to drop when you get your test coverage above 97%.
We routinely achieve test coverage levels above 99%, so I expect my defect level to be something around 1 in 10,000 units -- assuming all faults equally likely. Equation 1 is often plotted on different scales and it can look quite a bit different (see Appendix A).
When the process issue occurred and our tests did not catch it, the defect level now was approximately the same as that of the process issue. All of a sudden, I saw my defect level grow.
Fortunately, we found the problem and I now see my defect level down where I expect it to be.
Because we had a process problem that occurred much more than any other and that problem was not detected by our tests, that problem went out to customers at the same rate it occurred in the factory. We discovered the issue and put in a test that would catch this case, but I am left feeling a little uncomfortable by the whole affair. I am reminded of a quote from one of the Bernoulli boys (Jacob, Danial, or Johann -- I forget which) about finding an error in a proof, "If there is one tiger in the forest, might not there be more."
Appendix A: Example Use of Equation 1
You will often see Equation 1 graphed on different scales, which makes it hard to identify. The following figures illustrate what I mean. Figure 4 is from Toshiba ASIC Design Guide. Using the same scale, I plotted the same function in Figure 5 using Mathcad. They are the same.