Airbus A320 – intense solar radiation may corrupt data critical for flight

511 points by pyrophoenix 5 days ago

addaon 5 days ago

I’d really, really like to know what microcontroller family this was found on. Assuming that this is a safety processor (lockstep, ECC, etc) it suggests that ECC was insufficient for the level of bit flips they’re seeing — and if the concern is data corruption, not unintended restart, it means it’s enough flips in one word to be undetectable. The environment they’re operating in isn’t that different from everyone else, so unless they ate some margin elsewhere (bad voltage corner or something), this can definitely be relevant to others. Also would be interesting to know if it’s NVM or SRAM that’s effected.

Reply View 89 replies

RealityVoid 4 days ago

See my other comments in the other threads. This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips. That flight computer was designed in the 90's and updated in 2002 with a new hw variant that does have edac. So yes, for this kind of thing, I can buy that a bit flip happened.
You can see much more data in the report:
https://www.atsb.gov.au/sites/default/files/media/3532398/ao...

Reply View | 42 replies
- lxgr 4 days ago
  
  > This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips.
  Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?
  Maybe ECC was seen as redundant in that model?
  
  Reply View | 15 replies
  
  RealityVoid 4 days ago
  
  > Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?
  It was, and they did (well, same design, but they were independent). I quote from the report:
  "To provide redundancy, the ADIRS included three air data inertial reference units (ADIRU 1, ADIRU 2, and ADIRU 3). Each was of the same design, provided the same information, and operated independently of the other two"
  > Maybe ECC was seen as redundant in that model?
  I personally would not eschew any level of redundancy when it can improve safety, even in remote cases. It seems at the moment of the module's creation, EDAC was not required, and it probably was quite more expensive. The new variant apparently has EDAC. They retrofitted all units with the newer variants whenever one broke down. Overall, ECC is an extra layer of protection. The _presumably_ bit flip would be plausible to blame for data spikes. But even so, the data spikes should not have caused the controls issue. The controls issue is a separate problem, and it's highly likely THAT is what they are going to address, in another compute unit.
  "There was a limitation in the algorithm used by the A330/A340 flight control primary computers for processing angle of attack (AOA) data. This limitation meant that, in a very specific situation, multiple AOA spikes from only one of the three air data inertial reference units could result in a nose-down elevator command. [Significant safety issue]"
  This is most likely what they will address. The other reports confirm that the fix will be in the ELAC produced by Thales and the issue with the spikes detailed in the report was in an ADIRU module produced by Northrop Gruman.
  
  Reply View | 0 replies
  
  JorgeGT 4 days ago
  
  I don't know about the A320 but this was certainly the model for the Eurofighter. One of my university professors was in one of the teams, they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.
  
  Reply View | 10 replies
  
  p_l 3 days ago
  
  One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.
  ECC memory usage in the past was heavily correlated with, well, way lower quality of the hardware from chips to assembly, electromagnetic interference from unexpected sources, or even customer/field technician errors. Remember an early 1980s single user workstation might require extensive check & fix cycle just from moving it around.
  An aircraft component would eliminate all major parts of that, including both through more thorough self-testing, careful sealed design, selection of high grade parts, etc.
  The possibility of space radiation causing considerable issues came up as fully digital fly by wire became more common in civilian usage and has led over time to retrofitting with EDAC, but radiation-triggered SEU was deemed low enough risk due to design of the system.
  
  Reply View | 2 replies
- Reason077 4 days ago
  
  The recalled aircraft include the latest A320neo model, some of which are basically brand new. Why would they be using flight computers from before 2002? Why is an old report from 2008, relating to a completely different aircraft type (A330), relevant to the A320 issue today?
  
  Reply View | 16 replies
  
  t0mas88 4 days ago
  
  > Why would they be using flight computers from before 2002?
  Because getting a new one certified is extremely expensive. And designing an aircraft with a new type certificate is unpopular with the airlines. Since pilots are locked into a single type at a time, a mixed fleet is less efficient.
  Having a pilot switch type is very expensive, in the 50-100k per pilot range. And it comes with operational restrictions, you can't pair a newly trained (on type) captain with a newly trained first officer, so you need to manage all of this.
  
  Reply View | 6 replies
  
  LiamPowell 4 days ago
  
  > Why would they be using flight computers from before 2002?
  Why would you assume they're not? I don't know about aircraft specifically, but there's plenty of hardware that uses components older than that. Microchip still makes 8051 clones 45 years after the 8051 was released.
  
  Reply View | 2 replies
  
  RealityVoid 4 days ago
  
  The issue detailed in the linked report details why the spike happened in the first place on the ADIRU (produced by Northrop Gruman). The recalled controller is the ELAC that comes from Thales. The problem chain was that despite the ADIRU spiking up, the ELAC should not have taken the reactions it took. So they are fixing it in the ELAC.
  
  Reply View | 0 replies
  
  4ndrewl 4 days ago
  
  The neo is not brand new - it's an incremental update to the 320. neo refers to New Engine Option
  
  Reply View | 1 reply
  
  rkomorn 4 days ago
  
  They wrote "some of which are basically brand new", which is technically correct.
  They didn't say the design was brand new.
  
  Reply View | 0 replies
  
  p_l 3 days ago
  
  For the same reasons that Honeywell is building new devices with AMD 29050 CPUs today[1] - by sticking with the same CPU ISA, they can avoid recertifying portion of the software stack.
  [1] Honeywell actually bought full license et al from AMD and operates a fabless team that ensures they have stock and if necessary updates the chip.
  
  Reply View | 0 replies
  
  Havoc 4 days ago
  
  > Why would they be using flight computers from before 2002?
  Guessing that using previously certified stuff is an advantage
  
  Reply View | 0 replies
  
  RealityVoid 4 days ago
  
  Because the problem isn't just this. It's that the flight controller did not properly decide what to do when the data spiked because of this issue as well.
  
  Reply View | 0 replies
- Liftyee 4 days ago
  
  What does EDAC mean here? I wasn't able to find a definition. My guess is "error detection and correction"?
  Difference between it and ECC?
  
  Reply View | 7 replies
  
  K0balt 4 days ago
  
  EDAC is the concept, ECC is a family of algorithmic solutions in the service of the concept. Specific implementations of ECC are the engineering solution that implement the specific form of ECC in specific devices at the hardware or software level.
  It’s confusing because EDAC and ECC seem to mean the same thing, but ECC is a term primarily used in memory integrity, where EDAC is a system level concept.
  
  Reply View | 0 replies
  
  RealityVoid 4 days ago
  
  That was my initial confusion as well. It means exactly what you guessed, "Error detection and correction". The term is also spelled out in the report. I asked Claude about it (caveat emptor) and it said EDAC is the correct name for the circuitry and implementation itself whereas ECC is the algorithm. Gemini said that EDAC is the general technique and ECC is one implementation variant. So, at this point, I'm not sure. They are used interchangeably (maybe wrongly so), and in this case, we're referring to, essentially, the same thing, with maybe some small differences in the details. In my professional life, almost always I referred to ECC. In the report, they were only using EDAC. I thought I'd maintain consistency with the report so I tried using EDAC as well.
  
  Reply View | 5 replies
- asdefghyk 3 days ago
  
  Specifically of the above document , APPENDIX G: ELECTROMAGNETIC RADIATION mentions in detail how radiation (possible from the sun ) can cause flipped bits and other errors in electronic circuits in some detail ..... We are also at the PEAK of the 11 year sunspot...
  
  Reply View | 0 replies
TehCorwiz 5 days ago

An early revision of the Raspberry Pi 2 would crash if you hit it with a bright light like a camera flash. Specifically a xenon flash.
https://forums.raspberrypi.com/viewtopic.php?t=99167
https://forums.raspberrypi.com/viewtopic.php?f=28&t=99042
https://www.raspberrypi.com/news/xenon-death-flash-a-free-ph...
https://www.youtube.com/watch?v=wyptwlzRqaI

Reply View | 7 replies
- mlyle 4 days ago
  
  Yah, but that's a case of the package not being opaque enough.
  
  Reply View | 0 replies
- russdill 4 days ago
  
  Completely unrelated and due to a design failure by the rpi folks.
  
  Reply View | 5 replies
  
  hughw 4 days ago
  
  Is it really so unrelated? Isn't it a case where a similar phenomenon -- radiation impacting a computer calculation -- happened and it's one we can all relate to more easily, and reproduce if we cared to, than high altitude avionics? Not necessarily disputing but it just seems like a relatable case that helps me understand the issue better. If it's a radically different case somehow I'm interested to learn.
  
  Reply View | 4 replies
anonymousiam 4 days ago

proper SEU mitigation goes far beyond ECC. Satellites fly higher than the A320, and they (at least the ones I know about) use Triple Modular Redundancy: https://en.wikipedia.org/wiki/Triple_modular_redundancy
https://en.wikipedia.org/wiki/Single-event_upset
For manned spaceflight, NASA ups N from 3 to 5.
Other mitigations include completely disabling all CPU caches (with a big performance hit), and continuously refreshing the ECC RAM in background.
There are also a bunch of hardware mitigations to prevent "latch up" of the digital circuits.

Reply View | 11 replies
- rkagerer 4 days ago
  
  In redundant systems like these, how do you avoid the voting circuit becoming a single point of failure?
  Eg. I could understand if each subsystem had its own actuators and they were designed so any 3 could aerodynamically override the other 2, but I don't think that's how it works in practice.
  
  Reply View | 9 replies
  
  simne 4 days ago
  
  > how do you avoid the voting circuit becoming a single point of failure
  They do not. Just make voting circuit much more reliable than computing blocks.
  As example, computing block could be CMOS, but voting circuit made from discrete components, which are just too large to be sensitive to particles.
  Unfortunately, discrete components are sensitive to overall exposure (more than nm scale transistors), because large square gather more events and suffered by diffusion.
  Other example from aviation world - many planes still have mechanic connection of steering wheel to control surfaces, because mechanic connection considered ideally reliable. Unfortunately, at least one catastrophe happen because one pilot blocked his wheel and other cannot overcome this block.
  BTW weird fact, modern planes don't have rod physically connected to engine, because engine have it's own computer, which emulate behavior of old piston carburetor, and on Boeing emulating stick have electronic actuator, so it automatically placed in position, corresponding to actual engine mode, but Airbus don't have such actuator.
  I want to say - especially big planes (and planes overall), are weird mix of very conservative inherited mechanisms and new technologies.
  
  Reply View | 1 reply
  
  anonymousiam 4 days ago
  
  Electronics in high-radiation environments benefit from a large feature size with regard to SEU reduction, but you're correct that the larger parts degrade faster in such environments, so they've created "rad-hard" components to mitigate that issue.
  https://en.wikipedia.org/wiki/Radiation_hardening
  It's interesting to me that triple-voting wasn't as necessary on the older (rad-hard) processors. Every foundry in the world is steering toward CPUs with smaller and smaller feature sizes, because they are faster and consume less power, but the (very small) market for space-based processors wants large feature sizes. Because those aren't available anymore, TMR is the work-around.
  https://en.wikipedia.org/wiki/IBM_RAD6000
  https://en.wikipedia.org/wiki/RAD750
  Most modern space processing systems use a combination of rad-hard CPUs and TMR.
  
  Reply View | 0 replies
  
  cpgxiii 4 days ago
  
  In some cases, it is exactly the case of multiple independent actuators, such that the "voting" is effectively performed by the physical mechanism of the control surface.
  In other cases all of the subsystems implement the comparison logic and "vote themselves out" if their outputs diverge from the others. A lot of aircraft control systems are structured more as primary/secondary/backup where there is a defined order of reversion in case of disagreement, rather than voting between equals.
  But, more generally, it is very hard to eliminate all possible single points of failure in complex control systems, and there are many cases of previously unknown failure points appearing years or decades into service. Any sort of multi-drop shared data bus is very vulnerable to common failures, and this is a big part of the switch to ethernet-derived switched avionics systems (e.g. Afdx) from older multi-drop serial busses.
  
  Reply View | 0 replies
  
  AlphaSite 4 days ago
  
  Voting can be coordinated between the N cpus rather than an external arbiter (even making that redundant eventually required the CPUs to decide what to do if they disagree so may as well handle it internally).
  
  Reply View | 0 replies
  
  V__ 4 days ago
  
  Can't this be solved by having a "high refresh-rate"? Even if the voting circuit gets hit, if it updates 60 times a second it won't really affect any mechanical parts since the next signal will quickly override the error?
  
  Reply View | 0 replies
  
  jasonwatkinspdx 4 days ago
  
  My understanding is you're roughly right: the actuators will have their own microcontroller. It receives commands from the say 3 flight computers, then decides locally how respond if they mismatch. Ie for 2 out of 3 matching it may continue as commanded, but with only 1 out of 3 it may shift into a fail safe strategy for whatever that actuator is doing.
  
  Reply View | 0 replies
  
  exe34 4 days ago
  
  if the issue is radiation bit flipping, you could make that part overly shielded?
  
  Reply View | 2 replies
- aborsy 4 days ago
  
  TMR and co are basically repetition codes, simplest performant least efficient ECC.
  
  Reply View | 0 replies
jayanmn 5 days ago

I am worried about a software fix for what looks like hardware problem.

Reply View | 24 replies
- afavour 5 days ago
  
  It could be as simple as storing multiple copies of the relevant data and adding a checksum, something like that.
  Hardware fix is the ultimate solution but it might be possible to paper over with software.
  
  Reply View | 1 reply
  
  [removed] 4 days ago
  
  [deleted]
  
  Reply View | 0 replies
- themerone 5 days ago
  
  Gracefully handling hardware faults is a software problem. The Air France Flight 447 crash was the result of bad software and bad hardware.
  
  Reply View | 18 replies
  
  foldr 4 days ago
  
  Crashes caused by pilots failing to execute proper stall recovery procedures are surprisingly common, and similar accidents have happened before in aircraft with traditional control schemes, so I’m skeptical that there are any hardware changes that would have made much difference. The official report doesn’t identify the hardware or software as significant factors.
  The moment to avoid the accident was probably the very first moment when Bonin entered a steep climb when the plane was already at 35,000 feet, only 2000 feet below the maximum altitude for its configuration. This was already a sufficiently insane thing to do that the other less senior pilot should have taken control, had CRM been functioning effectively. What actually happened is that both of the pilots in the cockpit at the start of the incident failed to identify that the plane was stalled despite the fact that (i) several stall warnings had sounded and (ii) the plane had climbed above its maximum altitude (where it would inevitably either stall or overspeed) and was now descending. It’s never very satisfying to blame pilots, but this was a monumental fuck up.
  If the pilots genuinely disagree about control inputs there is not much that hardware or software can do to help. Even on aircraft with traditional mechanically linked control columns like the 737, the linkage will break if enough pressure is applied in opposite directions by each pilot (a protection against jamming).
  
  Reply View | 4 replies
  
  f1shy 4 days ago
  
  And bad pilot training, if I recall correctly.
  
  Reply View | 4 replies
  
  exidy 4 days ago
  
  Although the pitot tubes on AF447 were due to be replaced with a type more resistant to icing, nonetheless there's no such thing as a 100% reliable pitot tube and there were procedures to follow in the event of unreliable airspeed indication. Had they been followed the accident would not have happened. Instead the co-pilot held back on his stick until the aircraft fell out of the sky.
  I don't believe there was any issue identified with the software of the plane.
  
  Reply View | 0 replies
  
  vel0city 4 days ago
  
  I'm reminded of the Apollo moon landing where the computer was rapidly rebooting and being in an OK-ish state to continue to be useful almost immediately
  
  Reply View | 5 replies
  
  idkfasayer 4 days ago
  
  [dead]
  
  Reply View | 0 replies
- kachapopopow 5 days ago
  
  software fixes are totally fine since the chance of two redundant pairs failing within the time it takes to correct these errors is more zero's than there are atoms in the universe. (each pilot has a redundant computer and because there's two pilots there's two redundant pairs)
  
  Reply View | 0 replies
- willis936 4 days ago
  
  It's a system problem. The system is being fixed.
  
  Reply View | 0 replies
- [removed] 5 days ago
  
  [deleted]
  
  Reply View | 0 replies
asdefghyk 3 days ago

RE "...The environment they’re operating in isn’t that different from everyone else...." NO this is incorrect. High flying aircraft more likely to suffer increased radiation caused by 11 year peak sunspot cycle . such aircraft should be using "radiation hardened electronics" , somewhat like spacecraft use...

Reply View | 0 replies

jpollock 4 days ago

The design of the system is very interesting, particularly how it expects to handle errors.

In 90's Telco, you used to have a pair of systems and if they disagreed, they would decide which side was bad and disable it.

In modern cloud, you accept there are errors. There's another request in ~10+ms. You only look when the error rate becomes commercially important.

My understanding of spacecraft is that there would be 3 independent implementations and they would vote.

The plane has a matrix of sensors and systems, allowing faults to be bubbled up and bad elements disabled independently.

The ADIRU does compare values to detect failures (median of 3 sensors), but they could only detect errors that last >1s. The flight computer used the raw data - because the sensors aren't interchangeable (they won't have consistent readings in all flight modes)!

Very nifty.

One thing, they say "memorisation period", I don't think it's a memorisation period? From my reading of the algorithm, it should be more "last value retention period"? Or "sensor spurious fault reading delay"?

Section 2.1 A330/A340 flight control system design "AOA computation logic"

https://www.atsb.gov.au/sites/default/files/media/3532398/ao...

Reply View 6 replies

jpollock 4 days ago

For example....
"Preliminary A330/A340 FCPC algorithm"
"The algorithm did not effectively manage a specific situation where AOA 2 and AOA 3 on one side of the aircraft were temporarily incorrect and AOA 1 on the other side of the aircraft was correct, resulting in ADR 1 being rejected."
So, you've got a system where _two_ of the three sensors are bad, and you need to deal with it.

Reply View | 3 replies
- Loudergood 4 days ago
  
  I'm in awe of the fact that two sensors can be wrong AND agree with each other.
  
  Reply View | 2 replies
  
  Nextgrid 4 days ago
  
  Those being analog sensors measuring analog, physical things, they will never exactly agree with each other; so there's a plausibility window. As long as the fault causes the sensors to remain within said window they will be considered as valid.
  
  Reply View | 0 replies
  
  UltraSane 4 days ago
  
  It is just like having range of values considered to be equal for floating point numbers.
  
  Reply View | 0 replies
rubatuga 4 days ago

Space computers are generally in 3 with a hot spare

Reply View | 1 reply
- sllabres 4 days ago
  
  Space shuttle had five.
  Four of them operating in a redundant set and the fifth performing non critical task, as descripted in [1]. The fifth is also programmed by a different contractor in a different programming language: #1-4 running the Primary Avionics Software System (PASS) programmed by IBM in HAL/S and #5 programmed by a different team of Rockwell International in assembly. [2]
  [1] https://people.cs.rutgers.edu/~uli/cs673/papers/RedundancyMa...
  [2] https://ntrs.nasa.gov/api/citations/20110014946/downloads/20...
  
  Reply View | 0 replies

rene_d 4 days ago

The Aviation Herald has more technical details:

https://avherald.com/h?article=52f1ffc3&opt=0

Reply View 2 replies

loxodrome 4 days ago

Thanks for the link. This line in particular is concerning.
"This identified vulnerability could lead in the worst case scenario to an uncommanded elevator movement that may result in exceeding the aircraft structural capability."

Reply View | 1 reply
- isodev 4 days ago
  
  Well, I think in the grand scheme of things (including on the ground), the range of safety faults that can be triggered by a simple bitflip at the wrong moment range from inconvenient to absolute disaster. So in that sense, I'm very happy that Airbus has managed to identify opportunities to improve their design to be even more resilient.
  
  Reply View | 0 replies

nickdothutton 4 days ago

I’d just like to point out that if you are in the computing industry long enough, you will get to see a few such incidents under different circumstances, not only in industries like aerospace. Mostly things like ECC save your a*, sometimes your software will be able to recognise a temporary spurious reading and disregard it because you had enough alternative checking logic, or in the case of realtime and safety critical maybe even your systems can take a vote between them. Got caught out by (cpu cache line) bit flips in the 90s, months of pain trying to track it down. Some of your will know :-)

Reply View 5 replies

LadyCailin 4 days ago

We noticed this in our logs once! We service a huge amount of traffic, and as part of that, we log what is effectively an enum. We did a summarization of this field once, and noticed that there were a couple of “impossible” values being logged. One of my coworkers realized that the string that actually got logged was exactly one bit off from a valid string, and we came to the conclusion that we were probably seeing cosmic rays in action, either in our service, or in the logging service.

Reply View | 3 replies
- tuetuopay 4 days ago
  
  I had a similar story on my NAS that got one btrfs path corrupt. Plopped in on the btrfs IRC, one of the devs noticed the inconsistency was one bitflip away from the right value. Incredibly they were able to give me the right commands to fix it! Got to give credit where it is due, btrfs took the safe path and refused to touch the affected directory until fixed, and has enough tooling to fix this.
  I won’t blame cosmic rays but more likely dying RAM. The NAS now runs ECC memory.
  
  Reply View | 0 replies
- Philip-J-Fry 4 days ago
  
  I also saw a similar thing. I also naively pointed at "cosmic rays". It wasn't until someone found the actual bug that I realised how unlikely that was.
  The actual bug was unsafe code somewhere else in the application corrupting the memory. The application worked fine, but the log message strings were being slightly corrupted. Just a random letter here and there being something it shouldn't be.
  The question really should have been, if this was truly cosmic interference, why only this service and why was the problem appearing more than once over multiple versions of the application?
  Cosmic rays are a great excuse to problems you don't yet understand. But the reality of them is extremely rare and it's like 99% a memory corruption bug caused by application code.
  
  Reply View | 0 replies
- [removed] 4 days ago
  
  [deleted]
  
  Reply View | 0 replies
Theodores 4 days ago

Is that you, Julian?
I jest, but, once upon a time I worked with an infallible developer. When my projects crashed and burned, I would assume that it was my lack of competence and take that as my starting point. However, my colleague would assume that it was a stray neutrino that had flipped a bit to trigger the failure, even if it was a reproducible error.
He would then work backwards from 93 million miles away to blame the client, blame the linux kernel, blame the device drivers and finally, once all of that and the 'three letter agencies' were eliminated, perhaps consider the problem was between his keyboard and his chair.
In all fairness, he was a genius, and, regarding the A320 situation, he would have been spot on!

Reply View | 0 replies

qaq 5 days ago

Has BoFesc vibes "It's friday, so I get into work early, before lunch even. The phone rings. Shit!

I turn the page on the excuse sheet. "SOLAR FLARES" stares out at me. I'd better read up on that..."

Reply View 2 replies

suprjami 5 days ago

Solar Flares was always my favourite result on the BoFH Excuse Generator.
http://jefflane.org/bofh/bofh.pl

Reply View | 0 replies
ezconnect 4 days ago

Solar flares are the best excuse. We just have to wait it out.

Reply View | 0 replies

pyb 4 days ago

The aerospace industry has had countermeasures in place against bit-flips for a long time, oftentimes thanks to redudancy

Airbus/Thales's fix in this case appears to add more error checking, and to restart the misbehaving component. https://bea.aero/fileadmin/user_upload/BEA2024-0404-BEA2025-...

("une supervision interne du composant à l’origine de la défaillance ; - un mécanisme de redémarrage automatique de ce composant dès lors que la défaillance est détectée)

Reply View 1 reply

nolist_policy 4 days ago

The linked document is not related to this incident.

Reply View | 0 replies

supernova87a 4 days ago

I wonder how the incident was diagnosed? Does the FDR record low level errors that might've contributed to this? I thought that it only recorded certain input parameters and high-level flight metrics but I'm no expert.

If a radiation event caused some bit-flip, how would you realize that's what triggered an error? Or maybe the FDR does record when certain things go wrong? I'm thinking like, voting errors of the main flight computers?

Anyway, would be very interested to know!

Reply View 1 reply

yread 4 days ago

From a comment on avherald:
"Had the same problem with low power CMOS 3 transistor memory cells used in implantable defibrillators in the 1990s. Needed software detection and correction upgrade for implanted devices, and radiation hardening for new devices. Issue was confirmed to be caused by solar radiation by flying devices between Sydney and Buenos Aires over the south pole multiple times, accumulating a statistically significant different error rate to control sample in Sydney."

Reply View | 0 replies

oofbey 4 days ago

This is in response to JetBlue flight 1230 from Cancun to Newark on October 30, 2025, where a cosmic ray of some kind flipped a bit and caused a dangerous situation. At the time there was a minor (G1) geomagnetic storm - meaning more cosmic rays than normal. The Planetary K-index was at 5. These are somewhat elevated numbers - enough to produce a visible Aurora in Canada, but probably not even the northernmost US. But also this level of space weather is also very common. We hit G1 or higher about once a week. That's the really damning part. If it had happened in a G4 or G5 storm, then the engineers might have responded "we can't fix everything", but this level of reliability is clearly unacceptable.

Reply View 0 replies

65a 4 days ago

There's a great postmortem here about what might have been a similar SEU (single event upset--bitflip) here: https://www.atsb.gov.au/sites/default/files/media/3532398/ao...

Reply View 0 replies

raverbashing 4 days ago

Apparently the fix is reverting to a previous version of the SW (see https://avherald.com/h?article=52f1ffc3&opt=0 )

Curious what a sw change might have done in terms of resiliency. Maybe an incorrect memory setting or some code path that is not calculating things redundantly maybe?

Reply View 1 reply

rootusrootus 4 days ago

So it's not just Boeing that can screw up software on an airplane. I guess now I have to be a little afraid of all the airliners.

Reply View | 0 replies

minitoar 4 days ago

We flew too close to the sun

Reply View 0 replies

ChrisArchitect 5 days ago

More discussion: https://news.ycombinator.com/item?id=46082296

Reply View 0 replies

rossjudson 4 days ago

My armchair guess is that they had a new control pathway not properly participating in their integrity hand-off protocols, doing some kind of transformation outside of that protection.

I once saw some HW engineers go nuts trying to find out why a storage device had an error rate several orders of magnitude higher than the extremely low error rate they expected (and triggering data corruption errors). It turns out to be one extremely deep VHDL-based control area for an FPGA that didn't properly do integrity. You'd have to flip a bit at an incredibly precise point in time for error to occur, but that's what was happening. When all the math was said and done, that FPGA control path integrity miss exactly accounted for the the higher error rate.

Reply View 0 replies

joelthelion 4 days ago

Do they really need to ground the entire fleet for that? One incident for ten thousand planes in the air for years. I'd think that giving airlines two months to fix it would be sufficient.

Reply View 16 replies

kijin 4 days ago

I imagine it could help with Airbus marketing.
"We take proactive measures, whereas our competitor only takes action after multiple fatal crashes!"

Reply View | 5 replies
- probably_wrong 4 days ago
  
  I know someone who is stranded in another continent thanks to this. Trust me, all the understanding I could have as a technical user has been offset by the MASSIVE pain in the ass that is rebooking an international flight. And non-technical users have heard "the plane will not travel because it requires a software update", which does not inspire confidence.
  As far as I'm concerned it has not helped with their marketing.
  
  Reply View | 2 replies
  
  lxgr 3 days ago
  
  > "the plane will not travel because it requires a software update", which does not inspire confidence.
  It actually inspires a lot of confidence to people who can at least think economically, if not technically:
  Grounding thousands of planes is very expensive (passengers get cash for that in at least the EU, and sometimes more than the ticket cost!), so doing it both shows that it’s probably a serious issue and it’s being taken seriously.
  
  Reply View | 1 reply
  
  probably_wrong 3 days ago
  
  First, I feel the implication that "if you aren't reassured is only because you're dumb" is unwarranted.
  With that out of the way, being expensive does not preclude shoddy work. At the end of the day, the only difference between "they are so concerned about security that they are willing to lose millions[1]" and "their process must be so bad that they have no other choice but to lose millions before their death trap cost them ten times that" is how good your previous perception of their airplanes is.
  I think that, had this exact same issue happened to Boeing, we would be having a very different conversation. As the current top-comment suggests, it would probably be less "these things happen" and more "they cheapened out on the ECC".
  [1] Disclaimer: I have no idea who loses money in this scenario, if it's also Airbus or if it's exclusively the airlines who bought them.
  
  Reply View | 0 replies
- brabel 4 days ago
  
  Imagine an airplane crashed in these 2 months. I bet you would join the chorus and blame them for gross negligence.
  
  Reply View | 1 reply
  
  kijin 4 days ago
  
  There's a huge difference between "manufacturer recommended updates, but airline waited until the last week to apply them" and "manufacturer didn't even acknowledge the issue" in terms of who the chorus is going to blame.
  
  Reply View | 0 replies
mrpippy 4 days ago

I don’t believe it’s been years, only the latest firmware version for the ELAC is affected. The fix is to downgrade (or replace hardware with a unit running earlier firmware)

Reply View | 0 replies
jfoster 4 days ago

I wonder who eats the cost of this? I presume it's the airlines.
So the immediate cost to Airbus of grounding the fleet is quite low, whilst the downside of not grounding the fleet (risk of incident, lawsuits, reputation, etc.) could be substantial.

Reply View | 1 reply
- Havoc 4 days ago
  
  Yeah should be airlines
  It sounds like the fix is fairly quick so probably not as expensive as the max multi month groundings
  I doubt anyone is going to sue. Repairs etc are a part of life when owning aircraft. So as long as Airbus makes this happen fast and smooth they’re probably ok
  
  Reply View | 0 replies
miyuru 4 days ago

this is Airbus, not Boeing

Reply View | 0 replies
f1shy 4 days ago

I would personally not want to seat in those planes in those 2 months.

Reply View | 1 reply
- upcoming-sesame 4 days ago
  
  nothing worse than rushing a fix in production - only to find out the fix has caused more damage than the original bug
  
  Reply View | 0 replies
refulgentis 4 days ago

Yeah, because the alternative is knowing you might kill people due to a mundane engineering known issue.

Reply View | 0 replies
pyb 4 days ago

I get the feeling that they are doing this partly for marketing purposes.

Reply View | 0 replies
Bud 4 days ago

From their viewpoint, you have to think about what happens if, after they became aware of this vulnerability, there was then a crash because they weren't prompt and aggressive enough in addressing it. That's the kind of thing that ruins your entire company forever.

Reply View | 1 reply
- Esophagus4 4 days ago
  
  Yep - Boeing is still dealing with it years later.
  (As they should - I’m still very mad at them.)
  
  Reply View | 0 replies

owenthejumper 5 days ago

A friend works at Jetblue. They are scrambling hard to do the updates.

Reply View 0 replies

skx001 4 days ago

This video shows the the A320 computer and how the computer cooling system works

https://www.youtube.com/watch?v=HQuc_HhW6VA

Reply View 0 replies

jfoster 5 days ago

I've noticed that some carriers seem to be suggesting that there might be no impact to flights, but isn't this an immediate grounding for each aircraft until the update is made?

How is it possible that this wouldn't impact upon flight schedules?

Reply View 5 replies

icegreentea2 5 days ago

The grounding is for 6000 of 11000 A320 series. I believe it's some combination of software and hardware configuration that is at risk.

Reply View | 2 replies
- jfoster 4 days ago
  
  Thank you; that makes sense. I had the impression it was the entire fleet.
  
  Reply View | 1 reply
  
  julik 4 days ago
  
  It depends on whether the ELAC is an LRU (line-replaceable unit, i.e. a box with ports that can be swapped at an airport) and whether a software update can be uploaded into a unit that is installed (not all aircraft have a "firmware update via cable or floppy", so to speak)
  
  Reply View | 0 replies
arrel 5 days ago

N of 1, but I’m stuck in phoenix overnight because our flight was delayed an hour and a half by airbus maintenance and we missed our connection.

Reply View | 0 replies
simne 4 days ago

If possible for exact this plane, could make software update just as routine procedure.
But as I hear, air transporters could buy planes in different configurations, so for example, Emirates airlines, or Lufthansa always buy planes with all features included, but small Asian airlines could buy limited configuration (even without some safety indicators).
So for Emirates or Lufthansa, will need one empty flight to home airport, but for small airline will need to flight to some large maintenance base (or to factory base) and wait in queue there (you could find in internet images of Boeing factory base with lot of grounded 737-MAXes few years ago).
So for Emirates or Lufthansa will be minimal impact to flights (just like replacement of bus), but for small airlines things could be much worse.

Reply View | 0 replies

kappi 4 days ago

Following the Airbus A320 emergency airworthiness action, everyone will be talking about the ELAC (Elevator Aileron Computer) manufactured by Thales, which caused a sudden pitch-down without pilot input on JetBlue 1230 back in October.

So here’s everything you need to know about ELAC.

The ELAC System in the Airbus A320: The Brains Behind Pitch and Roll Control https://x.com/Turbinetraveler/status/1994498724513345637

Reply View 1 reply

ThePowerOfFuet 4 days ago

https://xcancel.com/Turbinetraveler/status/19944987245133456...

Reply View | 0 replies

1970-01-01 4 days ago

They said the same thing at Toyota when the unintended accel problem was in the news, but never found a real world example. There are a lot more old Toyotas still on the road than Airbuses in the air, so distance to the sun makes all the difference here? I wonder if they only see issues when flying near the north pole?

Reply View 0 replies

albert_e 4 days ago

What if future aircraft had "OTA" updates to software... using this as an example of avoidable downtime.

OTA updates to cars makes me feel uneasy -- not knowing what new bugs it might introduce.

Reply View 1 reply

p_l 3 days ago

Updates are distributed online, but applied by technician with a portable data loader device (a computer with special - partially standardized - interface cable).
The actual update in this case is about 15 minute work, the often seen "2h" claim is the entire cycle of powering the aircraft down for maintenance, update, verification, etc.

Reply View | 0 replies

nubinetwork 4 days ago

Why would a CME disrupt a single brand and model of aircraft, when the entire planet is covered in computers that almost never have bitflip issues when a CME rolls through every few months?

Reply View 2 replies

squarefoot 4 days ago

I would guess barely enough cable shielding paired with long enough paths along the aircraft so that the signals there would be more likely affected by EM induced currents.

Reply View | 0 replies
1970-01-01 4 days ago

I'm guessing EM shielding flaw or something electronic. See my comment on the Toyotas. It doesn't make sense from a raw probability perspective.

Reply View | 0 replies

jakub_g 4 days ago

From newspaper reporting on this, they are rolling back a software update. I wonder what was the original cause or the update? How often are flight computers software updated and why?

Reply View 1 reply

julik 4 days ago

This ELAC version is 100-something, and the A320 first flew around 1988. Why the updates - for example, there are updates to flight control law transitions, like after 1991 where the aircraft would limit flight control inputs during landing, thinking it would be preventing a stall - because it would not go into the flare law appropriately. See https://en.wikipedia.org/wiki/Iberia_Flight_1456
The cause could have also been an extra check introduced in one of the routines - which backfired in this particular failure scenario.

Reply View | 0 replies

op00to 5 days ago

Solar radiation like solar wind, or sunlight? They don’t say.

Reply View 18 replies

mr_toad 5 days ago

“Analysis of a recent event”
I presume they mean a Coronal Mass Ejection.

Reply View | 16 replies
- bparsons 5 days ago
  
  There was a very large CME ten days ago. The NOAA scale had predicted a high likelihood of disruptions, and had specifically suggested that spacecraft and high altitude aircraft could be impacted.
  https://www.swpc.noaa.gov/noaa-scales-explanation
  https://kauai.ccmc.gsfc.nasa.gov/CMEscoreboard/prediction/de...
  
  Reply View | 1 reply
  
  glaucon 5 days ago
  
  FWIW the "industry sources say" line on the incident is that it occurred on 30 October[1], so further back than ten days ago but of course there may have been other CME incidents at that time.
  The European Agency Aviation Safety Agency [2] instruction describes the characteristics of the incident but not the date.
  [1] https://www.theguardian.com/business/2025/nov/28/airbus-issu...
  [2] https://ad.easa.europa.eu/ad/2025-0268-E
  
  Reply View | 0 replies
- fwip 5 days ago
  
  I feel like the event was something that happened to a plane. That said, I wouldn't think sunlight would be penetrating to the chips running the plane.
  
  Reply View | 12 replies
  
  awesome_dude 5 days ago
  
  > The grounding of Airbus A320neo aircraft around the world can be traced back to an incident on a JetBlue flight operating a Cancun to New Jersey service on 30 October.
  > At least 15 passengers were injured and taken to the hospital after a sudden drop in altitude on the flight from Mexico was forced to make an emergency landing in Florida, US aviation officials said at the time.
  > The Thursday flight from Cancun was headed to Newark, New Jersey, when the altitude dropped, leading to the diversion to Tampa International Airport, the US Federal Aviation Administration said in a statement.
  > Pilots reported “a flight control issue” and described injuries including a possible “laceration in the head,” according to air traffic audio recorded by LiveATC.net.
  > Medical personnel met the passengers and crew on the ground at the airport. Between 15 and 20 people were taken to hospitals with non-life-threatening injuries, said Vivian Shedd, a spokesperson for Tampa Fire Rescue.
  > Pablo Rojas, a Miami-based attorney who specialises in aviation law, said a “flight control issue” indicated that the aircraft wasn't responding to the pilots.
  https://www.stuff.co.nz/travel/360903363/what-happened-fligh...
  
  Reply View | 9 replies
  
  dtagames 5 days ago
  
  Gamma rays penetrate everything and have definitely been known to disrupt computer circuits.
  
  Reply View | 1 reply
  
  fwip 4 days ago
  
  Yes, which is why the solar flare scenario makes more sense.
  
  Reply View | 0 replies
- nQQKTz7dm27oZ 4 days ago
  
  [dead]
  
  Reply View | 0 replies
tyingq 4 days ago

Would guess "cosmic rays". Sun Microsystems had a batch of UltraSparc servers that were very sensitive to it, and it was a big issue.
https://docs.oracle.com/cd/E19095-01/sf4810.srvr/816-5053-10...
https://en.wikipedia.org/wiki/Cosmic_ray

Reply View | 0 replies

[removed] 5 days ago

[deleted]

Reply View 0 replies

jMyles 5 days ago

This is one of the rare cases where, IMO, it makes sense to use a modified title as you've done here.

Reply View 1 reply

jb1991 5 days ago

[flagged]

Reply View | 0 replies

rishabhaiover 4 days ago

I hope Airbus only uses Honeywell or Collins in their newer planes.

Reply View 1 reply

p_l 3 days ago

It was Honeywell parts that failed in previous two events (2006 and 2008)
EDIT: properly credited, it was Thales Honeywell I guess :)

Reply View | 0 replies

asdefghyk 3 days ago

Intense solar radiation will be at a peak, since it is NOW the peak of the 11 year sunspot cycle.

Good related reading on this page .... https://en.wikipedia.org/wiki/Radiation_hardening ... includes a range of mitigation effects.

( I would be interested to find out how they actually test these systems. What combinations of hardware hardening and software logic. ALso do they actually subject to system to radiation as part of the testing )

https://en.wikipedia.org/wiki/Radiation_hardening

Reply View 1 reply

p_l 3 days ago

In early 1990s, when possibility of radiation caused upsets came to airliners, tests were conducted with various kinds of radiation including firing heavy ions from particle accelerators

Reply View | 0 replies

rvz 4 days ago

[flagged]

Reply View 0 replies

viiralvx 5 days ago

I was traveling during this entire ordeal. My flight got delayed by 7 hours. Insane day, just now boarding my flight. American Airlines was in shambles today.

Reply View 0 replies