Comment by chermi

Comment by chermi 14 hours ago

5 replies

Do we actually know how they're degrading? Are there still Pascals out there? If not, is it because they actual broke or because of poor performance? I understand it's tempting to say near 100% workload for multiple years = fast degradation, but what are the actual stats? Are you talking specifically about the actual compute chip or the whole compute system -- I know there's a big difference now with the systems Nvidia is selling. How long do typical Intel/AMD CPU server chips last? My impression is a long time.

If we're talking about the whole compute system like a gb200, is there a particular component that breaks first? How hard are they to refurbish, if that particular component breaks? I'm guessing they didn't have repairability in mind, but I also know these "chips" are much more than chips now so there's probably some modularity if it's not the chip itself failing.

hxorr 12 hours ago

I watch a GPU repair guy and its interesting to see the different failure modes...

* memory IC failure

* power delivery component failure

* dead core

* cracked BGA solder joints on core

* damaged PCB due to sag

These issues are compounded by

* huge power consumption and heat output of core and memory, compared to system CPU/memory

* physical size of core leads to more potential for solder joint fracture due to thermal expansion/contraction

* everything needs to fit in PCIe card form factor

* memory and core not socketed, if one fails (or supporting circuitry on the PCB fails) then either expensive repair or the card becomes scrap

* some vendors have cards with design flaws which lead to early failure

* sometimes poor application of thermal paste/pads at factory (eg, only half of core making contact

* and, in my experience in aquiring 4-5 year old GPUs to build gaming PCs with (to sell), almost without fail the thermal paste has dried up and the card is thermal throttling

  • oskarkk 8 hours ago

    These failures of consumer GPUs may be not applicable to datacenter GPUs, as the datacenter ones are used differently, in a controlled environment, have completely different PCBs, different cooling, different power delivery, and are designed for reliability under constant max load.

    • fennecbutt 5 hours ago

      Yeah you're right. Definitely not applicable at all. Especially since nvidia often supplies them tied into the dgx units with cooling etc. Ie a controlled environment.

      Consuker gpu you have no idea if they've shoved it into a hotbox of a case or not

Workaccount2 12 hours ago

Believe it or not, the GPUs from bitcoin farms are often the most reliable.

Since they were run 24/7, there was rarely the kind of heat stress that kills cards (heating and cooling cycles).

  • buu700 8 hours ago

    Could AI providers follow the same strategy? Just throw any spare inference capacity at something to make sure the GPUs are running 24/7, whether that's model training, crypto mining, protein folding, a "spot market" for non-time-sensitive/async inference workloads, or something else entirely.