Comment by addaon

Comment by addaon 5 days ago

89 replies

I’d really, really like to know what microcontroller family this was found on. Assuming that this is a safety processor (lockstep, ECC, etc) it suggests that ECC was insufficient for the level of bit flips they’re seeing — and if the concern is data corruption, not unintended restart, it means it’s enough flips in one word to be undetectable. The environment they’re operating in isn’t that different from everyone else, so unless they ate some margin elsewhere (bad voltage corner or something), this can definitely be relevant to others. Also would be interesting to know if it’s NVM or SRAM that’s effected.

RealityVoid 4 days ago

See my other comments in the other threads. This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips. That flight computer was designed in the 90's and updated in 2002 with a new hw variant that does have edac. So yes, for this kind of thing, I can buy that a bit flip happened.

You can see much more data in the report:

https://www.atsb.gov.au/sites/default/files/media/3532398/ao...

  • lxgr 4 days ago

    > This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips.

    Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?

    Maybe ECC was seen as redundant in that model?

    • RealityVoid 4 days ago

      > Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?

      It was, and they did (well, same design, but they were independent). I quote from the report:

      "To provide redundancy, the ADIRS included three air data inertial reference units (ADIRU 1, ADIRU 2, and ADIRU 3). Each was of the same design, provided the same information, and operated independently of the other two"

      > Maybe ECC was seen as redundant in that model?

      I personally would not eschew any level of redundancy when it can improve safety, even in remote cases. It seems at the moment of the module's creation, EDAC was not required, and it probably was quite more expensive. The new variant apparently has EDAC. They retrofitted all units with the newer variants whenever one broke down. Overall, ECC is an extra layer of protection. The _presumably_ bit flip would be plausible to blame for data spikes. But even so, the data spikes should not have caused the controls issue. The controls issue is a separate problem, and it's highly likely THAT is what they are going to address, in another compute unit.

      "There was a limitation in the algorithm used by the A330/A340 flight control primary computers for processing angle of attack (AOA) data. This limitation meant that, in a very specific situation, multiple AOA spikes from only one of the three air data inertial reference units could result in a nose-down elevator command. [Significant safety issue]"

      This is most likely what they will address. The other reports confirm that the fix will be in the ELAC produced by Thales and the issue with the spikes detailed in the report was in an ADIRU module produced by Northrop Gruman.

    • JorgeGT 4 days ago

      I don't know about the A320 but this was certainly the model for the Eurofighter. One of my university professors was in one of the teams, they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.

      • RealityVoid 4 days ago

        > they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.

        Jeez, it would drive me _up the wall_. Let's say I could somewhat justify the security concerns, but this seems like it severely hampers the ability to design the system. And it seems like a safety concern.

    • p_l 3 days ago

      One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.

      ECC memory usage in the past was heavily correlated with, well, way lower quality of the hardware from chips to assembly, electromagnetic interference from unexpected sources, or even customer/field technician errors. Remember an early 1980s single user workstation might require extensive check & fix cycle just from moving it around.

      An aircraft component would eliminate all major parts of that, including both through more thorough self-testing, careful sealed design, selection of high grade parts, etc.

      The possibility of space radiation causing considerable issues came up as fully digital fly by wire became more common in civilian usage and has led over time to retrofitting with EDAC, but radiation-triggered SEU was deemed low enough risk due to design of the system.

      • addaon 3 days ago

        > One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.

        This does not match my experience (although, admittedly, I've been in the field only a couple decades -- the hardware under discussion predates that). The problem with SEU-induced bit flips is not that errors happen, but that errors with unbounded behavior happen -- consider a bit flip in the program counter, especially in an architecture with variable sized instructions. This drives requirements around error detection, not correction -- but the three main tools here are lockstep processor cores, parity on small memories, and SECDED on large memories. SECDED ECC here is important both because it can catch double errors that happen close together in time, and because memory scrubbing with single error correction allows multiple errors spaced in time to be considered separately. At the system level, the key insight is that detectable failures of a single ECU have to be handled anyway, because of non-transient statistical failures -- connector failures, tin whiskers, etc. The goal, then, is to convert substantially all failures to detectable failures, and then have defined failure behavior (often fail-silent). This leads to dual-dual redundancy architectures and similar, instead of triplex; each channel consists of two units that cross-check each other, and downstream units can assume that commands received from either channel are either correct or absent.

        • p_l 3 days ago

          The incident report on the 2008 case specifically mentions SEU for memory - certain level of EDAC techniques is applied on the whole unit, but one area that was not covered was the possibility of non-catastrophic (in terms of operation) failure of memory module with a bitflip.

          An under-appreciated thing is also that the devices in question used to be rebooted pretty often which triggered self-test routines in addition to the run-time tests - something that didn't trigger anything in case of A330 in 2008, but was impactful in risk assessments missing certain things with 787 some years later (and newer A380/A350 recently).

  • Reason077 4 days ago

    The recalled aircraft include the latest A320neo model, some of which are basically brand new. Why would they be using flight computers from before 2002? Why is an old report from 2008, relating to a completely different aircraft type (A330), relevant to the A320 issue today?

    • t0mas88 4 days ago

      > Why would they be using flight computers from before 2002?

      Because getting a new one certified is extremely expensive. And designing an aircraft with a new type certificate is unpopular with the airlines. Since pilots are locked into a single type at a time, a mixed fleet is less efficient.

      Having a pilot switch type is very expensive, in the 50-100k per pilot range. And it comes with operational restrictions, you can't pair a newly trained (on type) captain with a newly trained first officer, so you need to manage all of this.

      • Reason077 4 days ago

        I think you're confusing a type certificate (certifying the airworthiness of the aircraft type) with a type rating, which certifies the pilot is qualified to operate that type.

        Significant internal hardware changes might indeed require re-certification, but it generally wouldn't mean that pilots need to re-qualify or get a new type rating.

      • RealityVoid 4 days ago

        Since the new versions of the same ADIRU have EDAC, they have been using it on planes since 2002 and they have been putting the EDAC variant in whenever an old one was being returned for repairs, I don't think this is the reason. I think the reason is that they had 3 ADIRU's and even if one got wonky, the algorithm on the ELAC flight computer would have to take the correct decision. It did not take the correct decision. The ELAC is the one being updated in this case.

    • LiamPowell 4 days ago

      > Why would they be using flight computers from before 2002?

      Why would you assume they're not? I don't know about aircraft specifically, but there's plenty of hardware that uses components older than that. Microchip still makes 8051 clones 45 years after the 8051 was released.

      • K0balt 4 days ago

        That’s just wild to think about. We should all strive to build solutions that plague our descendants with their persistent utility.

        • hylaride 4 days ago

          From a pure safety point of view, it's easier to deal with older, but well-understood products, only updating them if it's an actual safety issue. The alternative is having to deal with many generations of tech, as well as permutations with other components, that could get infinitely complicated. On top of that, it's extremely time consuming and expensive to certify new components.

          There's a reason the airlines and manufacturers hem and haw about new models until the economics overwhelmingly make it worthwhile, and even then it can still be a shitshow. The MCAS issue is case in point of how introducing new tech can cause unexpected issues (made worse by Boeing's internal culture).

          The 787 dreamliner is also a good example of how hard it is. By all accounts is a success, but it had some serious teething problems and still has some concerns about the long term wear and tear of the composite materials (though a lot of it's problems wasn't necessarily the application of new tech, but Boeing's simultaneous desire to overcomplicate the manufacturing pipeline via outsourcing and spreading out manufacturing).

    • RealityVoid 4 days ago

      The issue detailed in the linked report details why the spike happened in the first place on the ADIRU (produced by Northrop Gruman). The recalled controller is the ELAC that comes from Thales. The problem chain was that despite the ADIRU spiking up, the ELAC should not have taken the reactions it took. So they are fixing it in the ELAC.

    • 4ndrewl 4 days ago

      The neo is not brand new - it's an incremental update to the 320. neo refers to New Engine Option

      • rkomorn 4 days ago

        They wrote "some of which are basically brand new", which is technically correct.

        They didn't say the design was brand new.

    • p_l 3 days ago

      For the same reasons that Honeywell is building new devices with AMD 29050 CPUs today[1] - by sticking with the same CPU ISA, they can avoid recertifying portion of the software stack.

      [1] Honeywell actually bought full license et al from AMD and operates a fabless team that ensures they have stock and if necessary updates the chip.

    • Havoc 4 days ago

      > Why would they be using flight computers from before 2002?

      Guessing that using previously certified stuff is an advantage

    • RealityVoid 4 days ago

      Because the problem isn't just this. It's that the flight controller did not properly decide what to do when the data spiked because of this issue as well.

  • Liftyee 4 days ago

    What does EDAC mean here? I wasn't able to find a definition. My guess is "error detection and correction"?

    Difference between it and ECC?

    • K0balt 4 days ago

      EDAC is the concept, ECC is a family of algorithmic solutions in the service of the concept. Specific implementations of ECC are the engineering solution that implement the specific form of ECC in specific devices at the hardware or software level.

      It’s confusing because EDAC and ECC seem to mean the same thing, but ECC is a term primarily used in memory integrity, where EDAC is a system level concept.

    • RealityVoid 4 days ago

      That was my initial confusion as well. It means exactly what you guessed, "Error detection and correction". The term is also spelled out in the report. I asked Claude about it (caveat emptor) and it said EDAC is the correct name for the circuitry and implementation itself whereas ECC is the algorithm. Gemini said that EDAC is the general technique and ECC is one implementation variant. So, at this point, I'm not sure. They are used interchangeably (maybe wrongly so), and in this case, we're referring to, essentially, the same thing, with maybe some small differences in the details. In my professional life, almost always I referred to ECC. In the report, they were only using EDAC. I thought I'd maintain consistency with the report so I tried using EDAC as well.

      • Normal_gaussian 4 days ago

        Large portions of this comment provides zero to negative value. You've quoted two LLMs and couched it in "caveat emptor" and "so I'm not sure". The rest of your comment has then mused over this data you do not trust using generalities ("my profession" are you a JS S/W eng? A chip design specialist at ARM? A security researcher?).

        All of the value of your comment comes from the first sentence and the last two.

      • dgacmu 4 days ago

        The more correct and general answer is that:

        - EDAC is a term that encompasses anything used to detect and correct errors. While this almost always involves redundancy of some sort, _how_ it is done is unspecified.

        - The term ECC used stand-alone refers specifically to adding redundancy to data in the form of an error correcting code. But it is not a single algorithm - there are many ECC / FEC codes, from hamming codes used on small chunks of data such as data stored in RAM, to block codes like reed-solomon more commonly used on file storage data.

        - The term ECC memory could really just mean "EDAC" memory, but in practice, error correcting codes are _the_ way you'd do this from a cost perspective, so it works out. I don't think most systems would do triple redundancy on just the RAM -- at that point you'd run an independent microcontroller with the RAM to get higher-level TMR.

  • asdefghyk 3 days ago

    Specifically of the above document , APPENDIX G: ELECTROMAGNETIC RADIATION mentions in detail how radiation (possible from the sun ) can cause flipped bits and other errors in electronic circuits in some detail ..... We are also at the PEAK of the 11 year sunspot...

TehCorwiz 5 days ago
  • mlyle 4 days ago

    Yah, but that's a case of the package not being opaque enough.

  • russdill 4 days ago

    Completely unrelated and due to a design failure by the rpi folks.

    • hughw 4 days ago

      Is it really so unrelated? Isn't it a case where a similar phenomenon -- radiation impacting a computer calculation -- happened and it's one we can all relate to more easily, and reproduce if we cared to, than high altitude avionics? Not necessarily disputing but it just seems like a relatable case that helps me understand the issue better. If it's a radically different case somehow I'm interested to learn.

      • russdill 4 days ago

        No, because it's a completely different kind of radiation.

anonymousiam 4 days ago

proper SEU mitigation goes far beyond ECC. Satellites fly higher than the A320, and they (at least the ones I know about) use Triple Modular Redundancy: https://en.wikipedia.org/wiki/Triple_modular_redundancy

https://en.wikipedia.org/wiki/Single-event_upset

For manned spaceflight, NASA ups N from 3 to 5.

Other mitigations include completely disabling all CPU caches (with a big performance hit), and continuously refreshing the ECC RAM in background.

There are also a bunch of hardware mitigations to prevent "latch up" of the digital circuits.

  • rkagerer 4 days ago

    In redundant systems like these, how do you avoid the voting circuit becoming a single point of failure?

    Eg. I could understand if each subsystem had its own actuators and they were designed so any 3 could aerodynamically override the other 2, but I don't think that's how it works in practice.

    • simne 4 days ago

      > how do you avoid the voting circuit becoming a single point of failure

      They do not. Just make voting circuit much more reliable than computing blocks.

      As example, computing block could be CMOS, but voting circuit made from discrete components, which are just too large to be sensitive to particles.

      Unfortunately, discrete components are sensitive to overall exposure (more than nm scale transistors), because large square gather more events and suffered by diffusion.

      Other example from aviation world - many planes still have mechanic connection of steering wheel to control surfaces, because mechanic connection considered ideally reliable. Unfortunately, at least one catastrophe happen because one pilot blocked his wheel and other cannot overcome this block.

      BTW weird fact, modern planes don't have rod physically connected to engine, because engine have it's own computer, which emulate behavior of old piston carburetor, and on Boeing emulating stick have electronic actuator, so it automatically placed in position, corresponding to actual engine mode, but Airbus don't have such actuator.

      I want to say - especially big planes (and planes overall), are weird mix of very conservative inherited mechanisms and new technologies.

      • anonymousiam 4 days ago

        Electronics in high-radiation environments benefit from a large feature size with regard to SEU reduction, but you're correct that the larger parts degrade faster in such environments, so they've created "rad-hard" components to mitigate that issue.

        https://en.wikipedia.org/wiki/Radiation_hardening

        It's interesting to me that triple-voting wasn't as necessary on the older (rad-hard) processors. Every foundry in the world is steering toward CPUs with smaller and smaller feature sizes, because they are faster and consume less power, but the (very small) market for space-based processors wants large feature sizes. Because those aren't available anymore, TMR is the work-around.

        https://en.wikipedia.org/wiki/IBM_RAD6000

        https://en.wikipedia.org/wiki/RAD750

        Most modern space processing systems use a combination of rad-hard CPUs and TMR.

    • cpgxiii 4 days ago

      In some cases, it is exactly the case of multiple independent actuators, such that the "voting" is effectively performed by the physical mechanism of the control surface.

      In other cases all of the subsystems implement the comparison logic and "vote themselves out" if their outputs diverge from the others. A lot of aircraft control systems are structured more as primary/secondary/backup where there is a defined order of reversion in case of disagreement, rather than voting between equals.

      But, more generally, it is very hard to eliminate all possible single points of failure in complex control systems, and there are many cases of previously unknown failure points appearing years or decades into service. Any sort of multi-drop shared data bus is very vulnerable to common failures, and this is a big part of the switch to ethernet-derived switched avionics systems (e.g. Afdx) from older multi-drop serial busses.

    • AlphaSite 4 days ago

      Voting can be coordinated between the N cpus rather than an external arbiter (even making that redundant eventually required the CPUs to decide what to do if they disagree so may as well handle it internally).

    • V__ 4 days ago

      Can't this be solved by having a "high refresh-rate"? Even if the voting circuit gets hit, if it updates 60 times a second it won't really affect any mechanical parts since the next signal will quickly override the error?

    • jasonwatkinspdx 4 days ago

      My understanding is you're roughly right: the actuators will have their own microcontroller. It receives commands from the say 3 flight computers, then decides locally how respond if they mismatch. Ie for 2 out of 3 matching it may continue as commanded, but with only 1 out of 3 it may shift into a fail safe strategy for whatever that actuator is doing.

    • exe34 4 days ago

      if the issue is radiation bit flipping, you could make that part overly shielded?

      • baq 4 days ago

        Define ‘overly’. You can submerge it in a sphere of water, but that’s going to be expensive to launch.

        • exe34 4 days ago

          I suspect a couple millimeters of lead in the right place would do it. cheaper to shield the voting mechanism than the whole thing.

  • aborsy 4 days ago

    TMR and co are basically repetition codes, simplest performant least efficient ECC.

jayanmn 5 days ago

I am worried about a software fix for what looks like hardware problem.

  • afavour 5 days ago

    It could be as simple as storing multiple copies of the relevant data and adding a checksum, something like that.

    Hardware fix is the ultimate solution but it might be possible to paper over with software.

    • [removed] 4 days ago
      [deleted]
  • themerone 5 days ago

    Gracefully handling hardware faults is a software problem. The Air France Flight 447 crash was the result of bad software and bad hardware.

    • foldr 4 days ago

      Crashes caused by pilots failing to execute proper stall recovery procedures are surprisingly common, and similar accidents have happened before in aircraft with traditional control schemes, so I’m skeptical that there are any hardware changes that would have made much difference. The official report doesn’t identify the hardware or software as significant factors.

      The moment to avoid the accident was probably the very first moment when Bonin entered a steep climb when the plane was already at 35,000 feet, only 2000 feet below the maximum altitude for its configuration. This was already a sufficiently insane thing to do that the other less senior pilot should have taken control, had CRM been functioning effectively. What actually happened is that both of the pilots in the cockpit at the start of the incident failed to identify that the plane was stalled despite the fact that (i) several stall warnings had sounded and (ii) the plane had climbed above its maximum altitude (where it would inevitably either stall or overspeed) and was now descending. It’s never very satisfying to blame pilots, but this was a monumental fuck up.

      If the pilots genuinely disagree about control inputs there is not much that hardware or software can do to help. Even on aircraft with traditional mechanically linked control columns like the 737, the linkage will break if enough pressure is applied in opposite directions by each pilot (a protection against jamming).

      • julik 4 days ago

        True. I would say, however, that every "concept" of airliner flight deck has its own gimmicks that can kill. The Airbus "dual input" is such a gimmick. Even though there was, for example, an AF accident with a 777 where there was hardware linkage between yokes and the two pilots were fighting... each other. Physically.

    • f1shy 4 days ago

      And bad pilot training, if I recall correctly.

      • amelius 4 days ago

        I suppose because they were not instructed to work around the software and hardware flaws.

    • exidy 4 days ago

      Although the pitot tubes on AF447 were due to be replaced with a type more resistant to icing, nonetheless there's no such thing as a 100% reliable pitot tube and there were procedures to follow in the event of unreliable airspeed indication. Had they been followed the accident would not have happened. Instead the co-pilot held back on his stick until the aircraft fell out of the sky.

      I don't believe there was any issue identified with the software of the plane.

    • vel0city 5 days ago

      I'm reminded of the Apollo moon landing where the computer was rapidly rebooting and being in an OK-ish state to continue to be useful almost immediately

      • CrossVR 4 days ago

        It wasn't rebooting, it ran out of memory and started aborting lower priority tasks. It was a excellent example of robust programming in the face of unexpected usage scenarios.

  • kachapopopow 5 days ago

    software fixes are totally fine since the chance of two redundant pairs failing within the time it takes to correct these errors is more zero's than there are atoms in the universe. (each pilot has a redundant computer and because there's two pilots there's two redundant pairs)

  • willis936 4 days ago

    It's a system problem. The system is being fixed.

  • [removed] 5 days ago
    [deleted]
asdefghyk 3 days ago

RE "...The environment they’re operating in isn’t that different from everyone else...." NO this is incorrect. High flying aircraft more likely to suffer increased radiation caused by 11 year peak sunspot cycle . such aircraft should be using "radiation hardened electronics" , somewhat like spacecraft use...