Comment by lxgr

Comment by lxgr 4 days ago

15 replies

> This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips.

Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?

Maybe ECC was seen as redundant in that model?

RealityVoid 4 days ago

> Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?

It was, and they did (well, same design, but they were independent). I quote from the report:

"To provide redundancy, the ADIRS included three air data inertial reference units (ADIRU 1, ADIRU 2, and ADIRU 3). Each was of the same design, provided the same information, and operated independently of the other two"

> Maybe ECC was seen as redundant in that model?

I personally would not eschew any level of redundancy when it can improve safety, even in remote cases. It seems at the moment of the module's creation, EDAC was not required, and it probably was quite more expensive. The new variant apparently has EDAC. They retrofitted all units with the newer variants whenever one broke down. Overall, ECC is an extra layer of protection. The _presumably_ bit flip would be plausible to blame for data spikes. But even so, the data spikes should not have caused the controls issue. The controls issue is a separate problem, and it's highly likely THAT is what they are going to address, in another compute unit.

"There was a limitation in the algorithm used by the A330/A340 flight control primary computers for processing angle of attack (AOA) data. This limitation meant that, in a very specific situation, multiple AOA spikes from only one of the three air data inertial reference units could result in a nose-down elevator command. [Significant safety issue]"

This is most likely what they will address. The other reports confirm that the fix will be in the ELAC produced by Thales and the issue with the spikes detailed in the report was in an ADIRU module produced by Northrop Gruman.

JorgeGT 4 days ago

I don't know about the A320 but this was certainly the model for the Eurofighter. One of my university professors was in one of the teams, they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.

  • RealityVoid 4 days ago

    > they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.

    Jeez, it would drive me _up the wall_. Let's say I could somewhat justify the security concerns, but this seems like it severely hampers the ability to design the system. And it seems like a safety concern.

    • lambdaone 4 days ago

      What you are trying to minimize here is the error rate of the composite system, not the error rate of the individual modules. You take it as a given that all the teams are doing their human best to eliminate mistakes from their design. The idea of this is to make it likely that the mistakes that remain are different mistakes from those made by the other teams.

      Providing errors are independent, it's better to have three subsystems with 99% reliability in a voting arrangement than one system with 99.9% reliability.

      • saltcured 4 days ago

        This seems like it would need some referees who watch over the teams and intrude with, "no, that method is already claimed by the other team, do something else"!

        Otherwise, I can easily see teams doing parallel construction of the same techniques. So many developments seem to happen like this, due to everyone being primed by the same socio-technical environment...

    • K0balt 4 days ago

      The idea was to build three completely different systems to produce the same data, so that an error or problem in one could not be reasonably replicated in the others. In a case such as this, even ideas about algorithms could result in undesirable similarities that could end up propagating an attack surface, a logic error, or a hardware weakness or vulnerability. The desired result is that the teas solve the problem separately using distinct approaches and resources.

      • Filligree 4 days ago

        And did they?

        Sometimes the solution is obvious, such that if you ask three engineers to solve it you’ll get three copies of the same solution, whereas that might not happen if they’re able to communicate.

        I’m sure they knew what they were doing, but I wonder how they avoided that scenario.

    • lxgr 4 days ago

      How so? It’s a safety measure at least as much as a security one.

      It’s essentially a very intentional trade-off between groupthink and the wisdom of crowds, but it lands on a very different point on that scale than most other systems.

      Arguably the track record of Airbus’s fly-by-wire does them some justice for that decision.

p_l 3 days ago

One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.

ECC memory usage in the past was heavily correlated with, well, way lower quality of the hardware from chips to assembly, electromagnetic interference from unexpected sources, or even customer/field technician errors. Remember an early 1980s single user workstation might require extensive check & fix cycle just from moving it around.

An aircraft component would eliminate all major parts of that, including both through more thorough self-testing, careful sealed design, selection of high grade parts, etc.

The possibility of space radiation causing considerable issues came up as fully digital fly by wire became more common in civilian usage and has led over time to retrofitting with EDAC, but radiation-triggered SEU was deemed low enough risk due to design of the system.

  • addaon 3 days ago

    > One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.

    This does not match my experience (although, admittedly, I've been in the field only a couple decades -- the hardware under discussion predates that). The problem with SEU-induced bit flips is not that errors happen, but that errors with unbounded behavior happen -- consider a bit flip in the program counter, especially in an architecture with variable sized instructions. This drives requirements around error detection, not correction -- but the three main tools here are lockstep processor cores, parity on small memories, and SECDED on large memories. SECDED ECC here is important both because it can catch double errors that happen close together in time, and because memory scrubbing with single error correction allows multiple errors spaced in time to be considered separately. At the system level, the key insight is that detectable failures of a single ECU have to be handled anyway, because of non-transient statistical failures -- connector failures, tin whiskers, etc. The goal, then, is to convert substantially all failures to detectable failures, and then have defined failure behavior (often fail-silent). This leads to dual-dual redundancy architectures and similar, instead of triplex; each channel consists of two units that cross-check each other, and downstream units can assume that commands received from either channel are either correct or absent.

    • p_l 3 days ago

      The incident report on the 2008 case specifically mentions SEU for memory - certain level of EDAC techniques is applied on the whole unit, but one area that was not covered was the possibility of non-catastrophic (in terms of operation) failure of memory module with a bitflip.

      An under-appreciated thing is also that the devices in question used to be rebooted pretty often which triggered self-test routines in addition to the run-time tests - something that didn't trigger anything in case of A330 in 2008, but was impactful in risk assessments missing certain things with 787 some years later (and newer A380/A350 recently).