Software Development Paradigm Trap main page

Feedback: Software Development Paradigm Trap

Wayne Albin - 8/1/2006

(This message was written to the Editors of Embedded Systems Design magazine, but CCed to me.)

Mark Bereit (May, 2006) proposed good solutions to many of the problems of reliable embedded software. What was missing was a way to add robust redundancy. About 40 years ago at the University of Washington, a physics PhD candidate, assisted by a department staff programmer, was using a method well-suited for combining with Mr. Bereit's multiple-processor approach. The student's IBM 7094 FORTRAN IV program had many functions calling functions calling functions and he knew that many calls would produce range errors. He thus wrote three versions of each function, using different algorithms. I remember his once asking me for a third way to compute factorials. I suggested table lookup.

All function calls went through a common routine that was called with the names of three functions and the calling parameters to the functions (the same for all three). The routine would return the median value and was also instrumented to print diagnostic information, used to find bad algorithms.

Today, with Mr. Bereit's approach, three or more algorithms could be used, and sometimes multiple processors could execute each algorithm. This would closely match the stranded wire or multiple fastener approach used in automobiles, as well as the limp home feature of truck engines.

Mark Bereit - 8/2/2006

Wayne,

Thanks for including me as a recipient of your e-mail. You're right, I certainly did not touch on redundancy in my article, and using alternate algorithm implementations would not have occurred to me. (I am familiar with the "space shuttle computer" model of several worker CPUs running the same algorithms on the same inputs, and a supervisor CPU monitoring the results and holding a vote in the event of disagreement; I seem to also recall hearing that this once failed significantly when the supervisor CPU went into the weeds.)

Wayne Albin - 8/2/2006

...I read most of what you had on the [article discussion] page, and have a few additional comments.

Using static CMOS processors that have several low power modes, including one that sleeps until an input transition occurs, can greatly reduce the power in a multiple-cpu approach over a FPGA approach.

For reliable operation, you must not have ANY single point failure mechanism. A jet fighter, for example, has at least two such failure mechanisms, the pilot and the engine. Multiple fighters provide redundancy for those failures. Inside a fighter, attention has been paid to surviving combat damage. For example, after the first gulf war I was at McClellan Air Force Base (since closed) and was in the fuel bladder repair room. I saw one that had many patches; the wing it was in had been hit by many bullets, but the plane and pilot survived and the plane was repaired at McClellan.

For a high-reliability system, for example, it should be possible during its demonstration to a potential customer, to pull and reinstall each circuit board in the system, other than passive backplanes. Likewise, each cable, including power cables, should be disconnected and reconnected. It should also be possible to disconnect every cable to any one physical box in the system without causing anything else in the system to fail.

I once examined a redundant system nearing completion for communicating with ATM machines. The first problem I noticed was that a single blower failure would cause a total system failure.

For your many CPU designs, eliminating single point failure mechanisms is particularly important. That includes never using the same mask version for redundant processors. (Yes, I have encountered bad masks, in TTL in memories, and in processors.)