An expert is a man who has made all the mistakes which can be made in a very narrow field.
— Niels Bohr, Nobel Prize winning physicist
I was reading an article this morning about a bug discovered in Boeing's 787 software that will occur after 248 days of continuous operation. The moment I read "248 days", I let out a sigh – I have seen that number before.
Nearly every system I have worked on encountered a bug that occurred if the system was left on longer than a given period of time. In my case, the number was 248.551 days. It was caused when a 32-bit integer that was counting system running time in 100ths of a second overflowed. Figure 2 shows how you can calculate that number.
I read about this type of bug regularly. For example, during the Gulf War, the Patriot missile system had a bug that rendered it ineffective after 100 hours of continuous operation. While the details of this failure are different (24 bit integer, counting 10ths of seconds, failure actually caused by inaccurate type conversion), the outcome was system failure. An Arianne rocket also experienced a vehicle loss because of an integer overflow.
Boeing is encountering problems that many other engineers have experienced. For example, like Boeing, I also have dealt with lithium battery issues (see this post). While engineers generally do not make the same mistakes over and over, there are enough differences between systems that previous lessons must be modified to apply them correctly to new situations. This is where the problems occur.
I remember reading years ago that NASA would have independent teams developing software that they would run on different computers and then they would have the computers vote on the correct answer. They soon found out that people writing software tend to make similar errors. The same is true for hardware engineers.