Comment by anonymousiam

Comment by anonymousiam 4 days ago

11 replies

proper SEU mitigation goes far beyond ECC. Satellites fly higher than the A320, and they (at least the ones I know about) use Triple Modular Redundancy: https://en.wikipedia.org/wiki/Triple_modular_redundancy

https://en.wikipedia.org/wiki/Single-event_upset

For manned spaceflight, NASA ups N from 3 to 5.

Other mitigations include completely disabling all CPU caches (with a big performance hit), and continuously refreshing the ECC RAM in background.

There are also a bunch of hardware mitigations to prevent "latch up" of the digital circuits.

rkagerer 4 days ago

In redundant systems like these, how do you avoid the voting circuit becoming a single point of failure?

Eg. I could understand if each subsystem had its own actuators and they were designed so any 3 could aerodynamically override the other 2, but I don't think that's how it works in practice.

  • simne 4 days ago

    > how do you avoid the voting circuit becoming a single point of failure

    They do not. Just make voting circuit much more reliable than computing blocks.

    As example, computing block could be CMOS, but voting circuit made from discrete components, which are just too large to be sensitive to particles.

    Unfortunately, discrete components are sensitive to overall exposure (more than nm scale transistors), because large square gather more events and suffered by diffusion.

    Other example from aviation world - many planes still have mechanic connection of steering wheel to control surfaces, because mechanic connection considered ideally reliable. Unfortunately, at least one catastrophe happen because one pilot blocked his wheel and other cannot overcome this block.

    BTW weird fact, modern planes don't have rod physically connected to engine, because engine have it's own computer, which emulate behavior of old piston carburetor, and on Boeing emulating stick have electronic actuator, so it automatically placed in position, corresponding to actual engine mode, but Airbus don't have such actuator.

    I want to say - especially big planes (and planes overall), are weird mix of very conservative inherited mechanisms and new technologies.

    • anonymousiam 4 days ago

      Electronics in high-radiation environments benefit from a large feature size with regard to SEU reduction, but you're correct that the larger parts degrade faster in such environments, so they've created "rad-hard" components to mitigate that issue.

      https://en.wikipedia.org/wiki/Radiation_hardening

      It's interesting to me that triple-voting wasn't as necessary on the older (rad-hard) processors. Every foundry in the world is steering toward CPUs with smaller and smaller feature sizes, because they are faster and consume less power, but the (very small) market for space-based processors wants large feature sizes. Because those aren't available anymore, TMR is the work-around.

      https://en.wikipedia.org/wiki/IBM_RAD6000

      https://en.wikipedia.org/wiki/RAD750

      Most modern space processing systems use a combination of rad-hard CPUs and TMR.

  • cpgxiii 4 days ago

    In some cases, it is exactly the case of multiple independent actuators, such that the "voting" is effectively performed by the physical mechanism of the control surface.

    In other cases all of the subsystems implement the comparison logic and "vote themselves out" if their outputs diverge from the others. A lot of aircraft control systems are structured more as primary/secondary/backup where there is a defined order of reversion in case of disagreement, rather than voting between equals.

    But, more generally, it is very hard to eliminate all possible single points of failure in complex control systems, and there are many cases of previously unknown failure points appearing years or decades into service. Any sort of multi-drop shared data bus is very vulnerable to common failures, and this is a big part of the switch to ethernet-derived switched avionics systems (e.g. Afdx) from older multi-drop serial busses.

  • AlphaSite 4 days ago

    Voting can be coordinated between the N cpus rather than an external arbiter (even making that redundant eventually required the CPUs to decide what to do if they disagree so may as well handle it internally).

  • V__ 4 days ago

    Can't this be solved by having a "high refresh-rate"? Even if the voting circuit gets hit, if it updates 60 times a second it won't really affect any mechanical parts since the next signal will quickly override the error?

  • jasonwatkinspdx 4 days ago

    My understanding is you're roughly right: the actuators will have their own microcontroller. It receives commands from the say 3 flight computers, then decides locally how respond if they mismatch. Ie for 2 out of 3 matching it may continue as commanded, but with only 1 out of 3 it may shift into a fail safe strategy for whatever that actuator is doing.

  • exe34 4 days ago

    if the issue is radiation bit flipping, you could make that part overly shielded?

    • baq 4 days ago

      Define ‘overly’. You can submerge it in a sphere of water, but that’s going to be expensive to launch.

      • exe34 4 days ago

        I suspect a couple millimeters of lead in the right place would do it. cheaper to shield the voting mechanism than the whole thing.

aborsy 4 days ago

TMR and co are basically repetition codes, simplest performant least efficient ECC.