Redundancy is not everything
Even aviation industry still has to learn
05 10 10 - 20:08 Used tags: aviation, correctnes, design, ha, reliability
Years ago when I was still maintaining highly available server clusters and thinking how to improve them, I learned quickly that redundancy of the servers by itself only brings you complications. The key to a meaningful redundant server setup are the sensory methods that monitor health of each server and the logic that acts upon those health states. One of the lessons I learned was that when you monitor some parameter via different methods and you get different outputs, it's usually the method that's at fault, be it either a timing issue or some simple text parsing (everyone loves to play with float numbers in bash</sarcasm>) error.
Now I just read an excellent report from the Dutch Safety Board about a crash of Turkish 737-800 near Amsterdam Schiphol Airport last year. I was particularly interested in this accident because I know aircrafts usually carry two radio altimeters and I wondered what chain of events triggered a wrong reading from a single one that lead the plane to crash. Let me present my own view of this report and some thoughts that I got about the state of aviation software in general.
Lets begin with details:
- Report states that readings from the radio altimeter such were present on the accident airplane were recreated in the lab by direct coupling of the transmitter and receiver antenna. If this was indeed the reason for the wrong reading, then this means there is a serious oversight in the design of placement of these antennas or their grounding. FAIL
- Radio altimeter is calibrated in a way to display height of 0 when aircraft touches the ground on landing and to display height between -2 and -6 feet when the aircraft is parked at the gate. Therefore, reading of -8 should be treated as erroneous. But we read that readings from -20 to 2500 feet are acceptable as correct. This seems to me as a too generous boundary buffer applied at a wrong place, since it caused a known defect scenario to fall within the "correct" measurement range. FAIL
- Report states that radio altimeter computer has three operating modes: normal, where measurements are accepted, "fail warn", when altimeter detects internal error, stops providing measurements and warns about its condition and a "non computed data", when measurements are out of design specs and errors are silently ignored. So we have a condition where it is possible for altimeter to pass on data that is bogus but falls in the "correct" range and there is nothing above it that would verify its readings. FAIL
- Since radio altimeter measures height above ground (QFE in aviation terms), any logic that acts upon its input should figure out that any negative values in combination with air speed above stall speed are nonsense and disregard them. But apparently auto throttle blindly trusts everything a single radio altimeter is serving it, thus creating a single point of failure. FAIL
- Even though there are two radio altimeters on board an aircraft, there does not seem to be any checking of the values they provide to the system on the system level. It seems this is left to the end user of the system, the pilot. Report states that captain's display showed -8 and that copilots display showed proper height, but it seems like this has gone unnoticed. Why is this not done at lower levels is beyond my imagination. FAIL
- In digital environments such as computers, everything (and I mean everything) happens because of a reason. Report states that aircraft provided numerous unusual and unexpected warnings (gear warning while still high, speed brakes warning, captain's roll and pitch bars disappearing), but none were perceived by the crew as a problem. My interpretation of this is based on average user observation; users treat their PCs as persons and "you know, each person can have a bad day. It will behave better next time." Well, wrong. It's a machine, working as instructed by its program. If it does unexpected things, either program is wrong, input to the program is wrong or something is really wrong with the situation you're in. If the machine cannot communicate this fact to its user, this means we have a design problem with the user interface. Also, there's a remark in the report about pilots considering radio altimeters to be one of the least reliable instruments that frequently fail. Therefore, they're used to ignoring anything that has radio altimeter at the source. It turned out that their understanding of the possible consequences is lacking. FAIL
- Report states that after this accident the software company behind the autopilot started installing improved versions, where autopilot actually compares radio altimeter data with the pressure altimeter. Too late, but let's give them thumbs up for this.
I could expand on each point here, but let me just warn about two.
There's a saying in the IT world about "garbage in, garbage out". If you feed useless data to the system, you'll get useless results on output. I am noticing an increasingly common and worrisome trend of this in aviation industry. Each sensor that provide input data has more than two states (working, not working) and this fact is often overlooked. In the case of turkish accident, we have a radio altimeter feeding incorrect but perfectly valid measurement to the rest of the avionics. In the case of this 777 same happened with accelerometers; one of them was not totally dead but not totally alive either. It fed its garbage to autopilot which caused the plane to start jumping around mid flight. Similar thing happened on this A330 with inertial reference system: garbage in, garbage out, people were hitting the cabin ceiling. Same thing seems to have happened at the AF447 accident, but this time on the airspeed indicator, which apparently created a confusion in the cockpit that seems to have lead to an inflight brakeup. What's the solution for this? Simple, gather data for the same parameter from different sources. Airbus even submitted a patent for such system for air speed. None of the instruments should be considered as absolute authority. And this is the other thing I want to warn about - redundancy exists to be used. Avionics software should compare readings from all available sources for a single parameter, not treat "left" and "right" systems as two separate entities. Two 757 have crashed (both available for viewing in NG Air Crash Investigations series) because of clogged pitostatic system - none needed to crash. There was both GPS and inertial systems available that could be used to provide information about the basic aircraft parameters, but the system wasn't designed with this in mind.
Learning about the software design and engineering principles behind aviation software, I have a feeling that we'll see more dead bodies caused by this two issues. I'd be happy if someone can prove me wrong.
One comment