Comment by p_l

Comment by p_l 3 days ago

One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.

ECC memory usage in the past was heavily correlated with, well, way lower quality of the hardware from chips to assembly, electromagnetic interference from unexpected sources, or even customer/field technician errors. Remember an early 1980s single user workstation might require extensive check & fix cycle just from moving it around.

An aircraft component would eliminate all major parts of that, including both through more thorough self-testing, careful sealed design, selection of high grade parts, etc.

The possibility of space radiation causing considerable issues came up as fully digital fly by wire became more common in civilian usage and has led over time to retrofitting with EDAC, but radiation-triggered SEU was deemed low enough risk due to design of the system.

addaon 3 days ago

> One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.

This does not match my experience (although, admittedly, I've been in the field only a couple decades -- the hardware under discussion predates that). The problem with SEU-induced bit flips is not that errors happen, but that errors with unbounded behavior happen -- consider a bit flip in the program counter, especially in an architecture with variable sized instructions. This drives requirements around error detection, not correction -- but the three main tools here are lockstep processor cores, parity on small memories, and SECDED on large memories. SECDED ECC here is important both because it can catch double errors that happen close together in time, and because memory scrubbing with single error correction allows multiple errors spaced in time to be considered separately. At the system level, the key insight is that detectable failures of a single ECU have to be handled anyway, because of non-transient statistical failures -- connector failures, tin whiskers, etc. The goal, then, is to convert substantially all failures to detectable failures, and then have defined failure behavior (often fail-silent). This leads to dual-dual redundancy architectures and similar, instead of triplex; each channel consists of two units that cross-check each other, and downstream units can assume that commands received from either channel are either correct or absent.

Reply View 1 reply

p_l 3 days ago

The incident report on the 2008 case specifically mentions SEU for memory - certain level of EDAC techniques is applied on the whole unit, but one area that was not covered was the possibility of non-catastrophic (in terms of operation) failure of memory module with a bitflip.
An under-appreciated thing is also that the devices in question used to be rebooted pretty often which triggered self-test routines in addition to the run-time tests - something that didn't trigger anything in case of A330 in 2008, but was impactful in risk assessments missing certain things with 787 some years later (and newer A380/A350 recently).

Reply View | 0 replies