Comment by LadyCailin

Comment by LadyCailin 4 days ago

3 replies

We noticed this in our logs once! We service a huge amount of traffic, and as part of that, we log what is effectively an enum. We did a summarization of this field once, and noticed that there were a couple of “impossible” values being logged. One of my coworkers realized that the string that actually got logged was exactly one bit off from a valid string, and we came to the conclusion that we were probably seeing cosmic rays in action, either in our service, or in the logging service.

tuetuopay 4 days ago

I had a similar story on my NAS that got one btrfs path corrupt. Plopped in on the btrfs IRC, one of the devs noticed the inconsistency was one bitflip away from the right value. Incredibly they were able to give me the right commands to fix it! Got to give credit where it is due, btrfs took the safe path and refused to touch the affected directory until fixed, and has enough tooling to fix this.

I won’t blame cosmic rays but more likely dying RAM. The NAS now runs ECC memory.

Philip-J-Fry 4 days ago

I also saw a similar thing. I also naively pointed at "cosmic rays". It wasn't until someone found the actual bug that I realised how unlikely that was.

The actual bug was unsafe code somewhere else in the application corrupting the memory. The application worked fine, but the log message strings were being slightly corrupted. Just a random letter here and there being something it shouldn't be.

The question really should have been, if this was truly cosmic interference, why only this service and why was the problem appearing more than once over multiple versions of the application?

Cosmic rays are a great excuse to problems you don't yet understand. But the reality of them is extremely rare and it's like 99% a memory corruption bug caused by application code.