Comment by nickdothutton

Comment by nickdothutton 4 days ago

5 replies

I’d just like to point out that if you are in the computing industry long enough, you will get to see a few such incidents under different circumstances, not only in industries like aerospace. Mostly things like ECC save your a*, sometimes your software will be able to recognise a temporary spurious reading and disregard it because you had enough alternative checking logic, or in the case of realtime and safety critical maybe even your systems can take a vote between them. Got caught out by (cpu cache line) bit flips in the 90s, months of pain trying to track it down. Some of your will know :-)

LadyCailin 4 days ago

We noticed this in our logs once! We service a huge amount of traffic, and as part of that, we log what is effectively an enum. We did a summarization of this field once, and noticed that there were a couple of “impossible” values being logged. One of my coworkers realized that the string that actually got logged was exactly one bit off from a valid string, and we came to the conclusion that we were probably seeing cosmic rays in action, either in our service, or in the logging service.

  • tuetuopay 4 days ago

    I had a similar story on my NAS that got one btrfs path corrupt. Plopped in on the btrfs IRC, one of the devs noticed the inconsistency was one bitflip away from the right value. Incredibly they were able to give me the right commands to fix it! Got to give credit where it is due, btrfs took the safe path and refused to touch the affected directory until fixed, and has enough tooling to fix this.

    I won’t blame cosmic rays but more likely dying RAM. The NAS now runs ECC memory.

  • Philip-J-Fry 4 days ago

    I also saw a similar thing. I also naively pointed at "cosmic rays". It wasn't until someone found the actual bug that I realised how unlikely that was.

    The actual bug was unsafe code somewhere else in the application corrupting the memory. The application worked fine, but the log message strings were being slightly corrupted. Just a random letter here and there being something it shouldn't be.

    The question really should have been, if this was truly cosmic interference, why only this service and why was the problem appearing more than once over multiple versions of the application?

    Cosmic rays are a great excuse to problems you don't yet understand. But the reality of them is extremely rare and it's like 99% a memory corruption bug caused by application code.

  • [removed] 4 days ago
    [deleted]
Theodores 4 days ago

Is that you, Julian?

I jest, but, once upon a time I worked with an infallible developer. When my projects crashed and burned, I would assume that it was my lack of competence and take that as my starting point. However, my colleague would assume that it was a stray neutrino that had flipped a bit to trigger the failure, even if it was a reproducible error.

He would then work backwards from 93 million miles away to blame the client, blame the linux kernel, blame the device drivers and finally, once all of that and the 'three letter agencies' were eliminated, perhaps consider the problem was between his keyboard and his chair.

In all fairness, he was a genius, and, regarding the A320 situation, he would have been spot on!