Comment by RealityVoid

Comment by RealityVoid 4 days ago

42 replies

See my other comments in the other threads. This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips. That flight computer was designed in the 90's and updated in 2002 with a new hw variant that does have edac. So yes, for this kind of thing, I can buy that a bit flip happened.

You can see much more data in the report:

https://www.atsb.gov.au/sites/default/files/media/3532398/ao...

lxgr 4 days ago

> This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips.

Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?

Maybe ECC was seen as redundant in that model?

  • RealityVoid 4 days ago

    > Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?

    It was, and they did (well, same design, but they were independent). I quote from the report:

    "To provide redundancy, the ADIRS included three air data inertial reference units (ADIRU 1, ADIRU 2, and ADIRU 3). Each was of the same design, provided the same information, and operated independently of the other two"

    > Maybe ECC was seen as redundant in that model?

    I personally would not eschew any level of redundancy when it can improve safety, even in remote cases. It seems at the moment of the module's creation, EDAC was not required, and it probably was quite more expensive. The new variant apparently has EDAC. They retrofitted all units with the newer variants whenever one broke down. Overall, ECC is an extra layer of protection. The _presumably_ bit flip would be plausible to blame for data spikes. But even so, the data spikes should not have caused the controls issue. The controls issue is a separate problem, and it's highly likely THAT is what they are going to address, in another compute unit.

    "There was a limitation in the algorithm used by the A330/A340 flight control primary computers for processing angle of attack (AOA) data. This limitation meant that, in a very specific situation, multiple AOA spikes from only one of the three air data inertial reference units could result in a nose-down elevator command. [Significant safety issue]"

    This is most likely what they will address. The other reports confirm that the fix will be in the ELAC produced by Thales and the issue with the spikes detailed in the report was in an ADIRU module produced by Northrop Gruman.

  • JorgeGT 4 days ago

    I don't know about the A320 but this was certainly the model for the Eurofighter. One of my university professors was in one of the teams, they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.

    • RealityVoid 4 days ago

      > they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.

      Jeez, it would drive me _up the wall_. Let's say I could somewhat justify the security concerns, but this seems like it severely hampers the ability to design the system. And it seems like a safety concern.

      • lambdaone 4 days ago

        What you are trying to minimize here is the error rate of the composite system, not the error rate of the individual modules. You take it as a given that all the teams are doing their human best to eliminate mistakes from their design. The idea of this is to make it likely that the mistakes that remain are different mistakes from those made by the other teams.

        Providing errors are independent, it's better to have three subsystems with 99% reliability in a voting arrangement than one system with 99.9% reliability.

        • saltcured 4 days ago

          This seems like it would need some referees who watch over the teams and intrude with, "no, that method is already claimed by the other team, do something else"!

          Otherwise, I can easily see teams doing parallel construction of the same techniques. So many developments seem to happen like this, due to everyone being primed by the same socio-technical environment...

      • K0balt 4 days ago

        The idea was to build three completely different systems to produce the same data, so that an error or problem in one could not be reasonably replicated in the others. In a case such as this, even ideas about algorithms could result in undesirable similarities that could end up propagating an attack surface, a logic error, or a hardware weakness or vulnerability. The desired result is that the teas solve the problem separately using distinct approaches and resources.

      • lxgr 4 days ago

        How so? It’s a safety measure at least as much as a security one.

        It’s essentially a very intentional trade-off between groupthink and the wisdom of crowds, but it lands on a very different point on that scale than most other systems.

        Arguably the track record of Airbus’s fly-by-wire does them some justice for that decision.

  • p_l 3 days ago

    One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.

    ECC memory usage in the past was heavily correlated with, well, way lower quality of the hardware from chips to assembly, electromagnetic interference from unexpected sources, or even customer/field technician errors. Remember an early 1980s single user workstation might require extensive check & fix cycle just from moving it around.

    An aircraft component would eliminate all major parts of that, including both through more thorough self-testing, careful sealed design, selection of high grade parts, etc.

    The possibility of space radiation causing considerable issues came up as fully digital fly by wire became more common in civilian usage and has led over time to retrofitting with EDAC, but radiation-triggered SEU was deemed low enough risk due to design of the system.

    • addaon 3 days ago

      > One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.

      This does not match my experience (although, admittedly, I've been in the field only a couple decades -- the hardware under discussion predates that). The problem with SEU-induced bit flips is not that errors happen, but that errors with unbounded behavior happen -- consider a bit flip in the program counter, especially in an architecture with variable sized instructions. This drives requirements around error detection, not correction -- but the three main tools here are lockstep processor cores, parity on small memories, and SECDED on large memories. SECDED ECC here is important both because it can catch double errors that happen close together in time, and because memory scrubbing with single error correction allows multiple errors spaced in time to be considered separately. At the system level, the key insight is that detectable failures of a single ECU have to be handled anyway, because of non-transient statistical failures -- connector failures, tin whiskers, etc. The goal, then, is to convert substantially all failures to detectable failures, and then have defined failure behavior (often fail-silent). This leads to dual-dual redundancy architectures and similar, instead of triplex; each channel consists of two units that cross-check each other, and downstream units can assume that commands received from either channel are either correct or absent.

      • p_l 3 days ago

        The incident report on the 2008 case specifically mentions SEU for memory - certain level of EDAC techniques is applied on the whole unit, but one area that was not covered was the possibility of non-catastrophic (in terms of operation) failure of memory module with a bitflip.

        An under-appreciated thing is also that the devices in question used to be rebooted pretty often which triggered self-test routines in addition to the run-time tests - something that didn't trigger anything in case of A330 in 2008, but was impactful in risk assessments missing certain things with 787 some years later (and newer A380/A350 recently).

Reason077 4 days ago

The recalled aircraft include the latest A320neo model, some of which are basically brand new. Why would they be using flight computers from before 2002? Why is an old report from 2008, relating to a completely different aircraft type (A330), relevant to the A320 issue today?

  • t0mas88 4 days ago

    > Why would they be using flight computers from before 2002?

    Because getting a new one certified is extremely expensive. And designing an aircraft with a new type certificate is unpopular with the airlines. Since pilots are locked into a single type at a time, a mixed fleet is less efficient.

    Having a pilot switch type is very expensive, in the 50-100k per pilot range. And it comes with operational restrictions, you can't pair a newly trained (on type) captain with a newly trained first officer, so you need to manage all of this.

    • Reason077 4 days ago

      I think you're confusing a type certificate (certifying the airworthiness of the aircraft type) with a type rating, which certifies the pilot is qualified to operate that type.

      Significant internal hardware changes might indeed require re-certification, but it generally wouldn't mean that pilots need to re-qualify or get a new type rating.

      • t0mas88 4 days ago

        No I meant designing a new aircraft with a new type certificate instead of creating the A320neo generation on the same type certificate. The parent comment wondered why Airbus would keep the old computers around, I tried to explain why they keep a lot of things the same and only incrementally add variants. Adding a variant allows them to be flown with the same type rating or with only differences training (that's what EASA calls it, not sure about the US term) which is much less costly.

    • RealityVoid 4 days ago

      Since the new versions of the same ADIRU have EDAC, they have been using it on planes since 2002 and they have been putting the EDAC variant in whenever an old one was being returned for repairs, I don't think this is the reason. I think the reason is that they had 3 ADIRU's and even if one got wonky, the algorithm on the ELAC flight computer would have to take the correct decision. It did not take the correct decision. The ELAC is the one being updated in this case.

  • LiamPowell 4 days ago

    > Why would they be using flight computers from before 2002?

    Why would you assume they're not? I don't know about aircraft specifically, but there's plenty of hardware that uses components older than that. Microchip still makes 8051 clones 45 years after the 8051 was released.

    • K0balt 4 days ago

      That’s just wild to think about. We should all strive to build solutions that plague our descendants with their persistent utility.

      • hylaride 4 days ago

        From a pure safety point of view, it's easier to deal with older, but well-understood products, only updating them if it's an actual safety issue. The alternative is having to deal with many generations of tech, as well as permutations with other components, that could get infinitely complicated. On top of that, it's extremely time consuming and expensive to certify new components.

        There's a reason the airlines and manufacturers hem and haw about new models until the economics overwhelmingly make it worthwhile, and even then it can still be a shitshow. The MCAS issue is case in point of how introducing new tech can cause unexpected issues (made worse by Boeing's internal culture).

        The 787 dreamliner is also a good example of how hard it is. By all accounts is a success, but it had some serious teething problems and still has some concerns about the long term wear and tear of the composite materials (though a lot of it's problems wasn't necessarily the application of new tech, but Boeing's simultaneous desire to overcomplicate the manufacturing pipeline via outsourcing and spreading out manufacturing).

  • RealityVoid 4 days ago

    The issue detailed in the linked report details why the spike happened in the first place on the ADIRU (produced by Northrop Gruman). The recalled controller is the ELAC that comes from Thales. The problem chain was that despite the ADIRU spiking up, the ELAC should not have taken the reactions it took. So they are fixing it in the ELAC.

  • 4ndrewl 4 days ago

    The neo is not brand new - it's an incremental update to the 320. neo refers to New Engine Option

    • rkomorn 4 days ago

      They wrote "some of which are basically brand new", which is technically correct.

      They didn't say the design was brand new.

  • p_l 3 days ago

    For the same reasons that Honeywell is building new devices with AMD 29050 CPUs today[1] - by sticking with the same CPU ISA, they can avoid recertifying portion of the software stack.

    [1] Honeywell actually bought full license et al from AMD and operates a fabless team that ensures they have stock and if necessary updates the chip.

  • Havoc 4 days ago

    > Why would they be using flight computers from before 2002?

    Guessing that using previously certified stuff is an advantage

  • RealityVoid 4 days ago

    Because the problem isn't just this. It's that the flight controller did not properly decide what to do when the data spiked because of this issue as well.

Liftyee 4 days ago

What does EDAC mean here? I wasn't able to find a definition. My guess is "error detection and correction"?

Difference between it and ECC?

  • K0balt 4 days ago

    EDAC is the concept, ECC is a family of algorithmic solutions in the service of the concept. Specific implementations of ECC are the engineering solution that implement the specific form of ECC in specific devices at the hardware or software level.

    It’s confusing because EDAC and ECC seem to mean the same thing, but ECC is a term primarily used in memory integrity, where EDAC is a system level concept.

  • RealityVoid 4 days ago

    That was my initial confusion as well. It means exactly what you guessed, "Error detection and correction". The term is also spelled out in the report. I asked Claude about it (caveat emptor) and it said EDAC is the correct name for the circuitry and implementation itself whereas ECC is the algorithm. Gemini said that EDAC is the general technique and ECC is one implementation variant. So, at this point, I'm not sure. They are used interchangeably (maybe wrongly so), and in this case, we're referring to, essentially, the same thing, with maybe some small differences in the details. In my professional life, almost always I referred to ECC. In the report, they were only using EDAC. I thought I'd maintain consistency with the report so I tried using EDAC as well.

    • Normal_gaussian 4 days ago

      Large portions of this comment provides zero to negative value. You've quoted two LLMs and couched it in "caveat emptor" and "so I'm not sure". The rest of your comment has then mused over this data you do not trust using generalities ("my profession" are you a JS S/W eng? A chip design specialist at ARM? A security researcher?).

      All of the value of your comment comes from the first sentence and the last two.

      • RealityVoid 4 days ago

        Sheesh, tough crowd.

        • ImPostingOnHN 4 days ago

          Feel free to consult LLMs, with all their downsides (like you needing to verify what they say, because it could be totally wrong).

          What you're doing here is half the job: consulting an LLM and sharing the output without verifying whether it is true. You're then saying 'okay everyone else, finish my job for me, specifically the hard part of it (the verification), while I did the easy part (asking a magic 8 ball)'.

          From this perspective, your comment could be viewed as disrespectful of others by asking them to finish your job, and of negative value because it could be totally hallucinated and false, and you didn't care enough about others to find out before posting it.

          tl;dr: 'I asked an LLM and it said X' will likely, for the near future, be downvoted just like 'I flipped a coin and it said X'. You should be pretty confident that what you post is not false before posting it, regardless of how you came up with it.

    • dgacmu 4 days ago

      The more correct and general answer is that:

      - EDAC is a term that encompasses anything used to detect and correct errors. While this almost always involves redundancy of some sort, _how_ it is done is unspecified.

      - The term ECC used stand-alone refers specifically to adding redundancy to data in the form of an error correcting code. But it is not a single algorithm - there are many ECC / FEC codes, from hamming codes used on small chunks of data such as data stored in RAM, to block codes like reed-solomon more commonly used on file storage data.

      - The term ECC memory could really just mean "EDAC" memory, but in practice, error correcting codes are _the_ way you'd do this from a cost perspective, so it works out. I don't think most systems would do triple redundancy on just the RAM -- at that point you'd run an independent microcontroller with the RAM to get higher-level TMR.

asdefghyk 3 days ago

Specifically of the above document , APPENDIX G: ELECTROMAGNETIC RADIATION mentions in detail how radiation (possible from the sun ) can cause flipped bits and other errors in electronic circuits in some detail ..... We are also at the PEAK of the 11 year sunspot...