Redundancy is not everything

Even aviation industry still has to learn

05 10 10 - 20:08 Used tags: , , , ,

Years ago when I was still maintaining highly available server clusters and thinking how to improve them, I learned quickly that redundancy of the servers by itself only brings you complications. The key to a meaningful redundant server setup are the sensory methods that monitor health of each server and the logic that acts upon those health states. One of the lessons I learned was that when you monitor some parameter via different methods and you get different outputs, it's usually the method that's at fault, be it either a timing issue or some simple text parsing (everyone loves to play with float numbers in bash</sarcasm>) error.

Now I just read an excellent report from  the Dutch Safety Board about a crash of Turkish 737-800 near Amsterdam Schiphol Airport last year. I was particularly interested in this accident because I know aircrafts usually carry two radio altimeters and I wondered what chain of events triggered a wrong reading from a single one that lead the plane to crash. Let me present my own view of this report and some thoughts that I got about the state of aviation software in general.

Lets begin with details:

I could expand on each point here, but let me just warn about two.

There's a saying in the IT world about "garbage in, garbage out". If you feed useless data to the system, you'll get useless results on output. I am noticing an increasingly common and worrisome trend of this in aviation industry. Each sensor that provide input data has more than two states (working, not working) and this fact is often overlooked. In the case of turkish accident, we have a radio altimeter feeding incorrect but perfectly valid measurement to the rest of the avionics. In the case of this 777 same happened with accelerometers; one of them was not totally dead but not totally alive either. It fed its garbage to autopilot which caused the plane to start jumping around mid flight. Similar thing happened on this A330 with inertial reference system: garbage in, garbage out, people were hitting the cabin ceiling. Same thing seems to have happened at the AF447 accident, but this time on the airspeed indicator, which apparently created a confusion in the cockpit that seems to have lead to an inflight brakeup. What's the solution for this? Simple, gather data for the same parameter from different sources. Airbus even submitted a patent for such system for air speed. None of the instruments should be considered as absolute authority. And this is the other thing I want to warn about - redundancy exists to be used. Avionics software should compare readings from all available sources for a single parameter, not treat "left" and "right" systems as two separate entities. Two 757 have crashed (both available for viewing in NG Air Crash Investigations series) because of clogged pitostatic system - none needed to crash. There was both GPS and inertial systems available that could be used to provide information about the basic aircraft parameters, but the system wasn't designed with this in mind.

Learning about the software design and engineering principles behind aviation software, I have a feeling that we'll see more dead bodies caused by this two issues. I'd be happy if someone can prove me wrong.

One comment

arctus - 27-08-’10 21:05

(optional field)
(optional field)
Remember personal info?
Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.