Comment by anonymousiam

Comment by anonymousiam 4 days ago

proper SEU mitigation goes far beyond ECC. Satellites fly higher than the A320, and they (at least the ones I know about) use Triple Modular Redundancy: https://en.wikipedia.org/wiki/Triple_modular_redundancy

https://en.wikipedia.org/wiki/Single-event_upset

For manned spaceflight, NASA ups N from 3 to 5.

Other mitigations include completely disabling all CPU caches (with a big performance hit), and continuously refreshing the ECC RAM in background.

There are also a bunch of hardware mitigations to prevent "latch up" of the digital circuits.

rkagerer 4 days ago

In redundant systems like these, how do you avoid the voting circuit becoming a single point of failure?

Eg. I could understand if each subsystem had its own actuators and they were designed so any 3 could aerodynamically override the other 2, but I don't think that's how it works in practice.

Reply View 9 replies

simne 4 days ago

> how do you avoid the voting circuit becoming a single point of failure
They do not. Just make voting circuit much more reliable than computing blocks.
As example, computing block could be CMOS, but voting circuit made from discrete components, which are just too large to be sensitive to particles.
Unfortunately, discrete components are sensitive to overall exposure (more than nm scale transistors), because large square gather more events and suffered by diffusion.
Other example from aviation world - many planes still have mechanic connection of steering wheel to control surfaces, because mechanic connection considered ideally reliable. Unfortunately, at least one catastrophe happen because one pilot blocked his wheel and other cannot overcome this block.
BTW weird fact, modern planes don't have rod physically connected to engine, because engine have it's own computer, which emulate behavior of old piston carburetor, and on Boeing emulating stick have electronic actuator, so it automatically placed in position, corresponding to actual engine mode, but Airbus don't have such actuator.
I want to say - especially big planes (and planes overall), are weird mix of very conservative inherited mechanisms and new technologies.

Reply View | 1 reply
- anonymousiam 4 days ago
  
  Electronics in high-radiation environments benefit from a large feature size with regard to SEU reduction, but you're correct that the larger parts degrade faster in such environments, so they've created "rad-hard" components to mitigate that issue.
  https://en.wikipedia.org/wiki/Radiation_hardening
  It's interesting to me that triple-voting wasn't as necessary on the older (rad-hard) processors. Every foundry in the world is steering toward CPUs with smaller and smaller feature sizes, because they are faster and consume less power, but the (very small) market for space-based processors wants large feature sizes. Because those aren't available anymore, TMR is the work-around.
  https://en.wikipedia.org/wiki/IBM_RAD6000
  https://en.wikipedia.org/wiki/RAD750
  Most modern space processing systems use a combination of rad-hard CPUs and TMR.
  
  Reply View | 0 replies
cpgxiii 4 days ago

In some cases, it is exactly the case of multiple independent actuators, such that the "voting" is effectively performed by the physical mechanism of the control surface.
In other cases all of the subsystems implement the comparison logic and "vote themselves out" if their outputs diverge from the others. A lot of aircraft control systems are structured more as primary/secondary/backup where there is a defined order of reversion in case of disagreement, rather than voting between equals.
But, more generally, it is very hard to eliminate all possible single points of failure in complex control systems, and there are many cases of previously unknown failure points appearing years or decades into service. Any sort of multi-drop shared data bus is very vulnerable to common failures, and this is a big part of the switch to ethernet-derived switched avionics systems (e.g. Afdx) from older multi-drop serial busses.

Reply View | 0 replies
AlphaSite 4 days ago

Voting can be coordinated between the N cpus rather than an external arbiter (even making that redundant eventually required the CPUs to decide what to do if they disagree so may as well handle it internally).

Reply View | 0 replies
V__ 4 days ago

Can't this be solved by having a "high refresh-rate"? Even if the voting circuit gets hit, if it updates 60 times a second it won't really affect any mechanical parts since the next signal will quickly override the error?

Reply View | 0 replies
jasonwatkinspdx 4 days ago

My understanding is you're roughly right: the actuators will have their own microcontroller. It receives commands from the say 3 flight computers, then decides locally how respond if they mismatch. Ie for 2 out of 3 matching it may continue as commanded, but with only 1 out of 3 it may shift into a fail safe strategy for whatever that actuator is doing.

Reply View | 0 replies
exe34 4 days ago

if the issue is radiation bit flipping, you could make that part overly shielded?

Reply View | 2 replies
- baq 4 days ago
  
  Define ‘overly’. You can submerge it in a sphere of water, but that’s going to be expensive to launch.
  
  Reply View | 1 reply
  
  exe34 4 days ago
  
  I suspect a couple millimeters of lead in the right place would do it. cheaper to shield the voting mechanism than the whole thing.
  
  Reply View | 0 replies

aborsy 4 days ago

TMR and co are basically repetition codes, simplest performant least efficient ECC.

Reply View 0 replies