Comment by hedora

Comment by hedora 4 days ago

3 replies

Single event upsets are already commonplace at sea level well below data center scale.

The section of the article that talks about them isn’t great. At least for FPGAs, the state of the art is to run 2-3 copies of the logic, and detect output discrepancies before they can create side effects.

I guess you could build a GPU that way, but it’d have 1/3 the parallelism as a normal one for the same die size and power budget. The article says it’d be a 2-3 order of magnitude loss.

It’s still a terrible idea, pf course.

sdenton4 3 days ago

It strikes me that neutral network inference loads are probably pretty resilient to these kinds of problems (as we see the bits per activation steadily decreasing), and where they aren't, you can add them as augmentations at training time and they will essentially act as regularization.

ACCount37 3 days ago

If you're using GPUs, you're running AI workloads. In which case: do you care?

One of the funniest things about modern AI systems is just how many random bitflips they can tank before their performance begins to really suffer.

jeltz 3 days ago

Sounds like it would remove a lot of the benefits gain from more solar power.