Comment by chermi

Comment by chermi 14 hours ago

Do we actually know how they're degrading? Are there still Pascals out there? If not, is it because they actual broke or because of poor performance? I understand it's tempting to say near 100% workload for multiple years = fast degradation, but what are the actual stats? Are you talking specifically about the actual compute chip or the whole compute system -- I know there's a big difference now with the systems Nvidia is selling. How long do typical Intel/AMD CPU server chips last? My impression is a long time.

If we're talking about the whole compute system like a gb200, is there a particular component that breaks first? How hard are they to refurbish, if that particular component breaks? I'm guessing they didn't have repairability in mind, but I also know these "chips" are much more than chips now so there's probably some modularity if it's not the chip itself failing.

hxorr 12 hours ago

I watch a GPU repair guy and its interesting to see the different failure modes...

* memory IC failure

* power delivery component failure

* dead core

* cracked BGA solder joints on core

* damaged PCB due to sag

These issues are compounded by

* huge power consumption and heat output of core and memory, compared to system CPU/memory

* physical size of core leads to more potential for solder joint fracture due to thermal expansion/contraction

* everything needs to fit in PCIe card form factor

* memory and core not socketed, if one fails (or supporting circuitry on the PCB fails) then either expensive repair or the card becomes scrap

* some vendors have cards with design flaws which lead to early failure

* sometimes poor application of thermal paste/pads at factory (eg, only half of core making contact

* and, in my experience in aquiring 4-5 year old GPUs to build gaming PCs with (to sell), almost without fail the thermal paste has dried up and the card is thermal throttling

Reply View 2 replies

oskarkk 8 hours ago

These failures of consumer GPUs may be not applicable to datacenter GPUs, as the datacenter ones are used differently, in a controlled environment, have completely different PCBs, different cooling, different power delivery, and are designed for reliability under constant max load.

Reply View | 1 reply
- fennecbutt 5 hours ago
  
  Yeah you're right. Definitely not applicable at all. Especially since nvidia often supplies them tied into the dgx units with cooling etc. Ie a controlled environment.
  Consuker gpu you have no idea if they've shoved it into a hotbox of a case or not
  
  Reply View | 0 replies

Workaccount2 12 hours ago

Believe it or not, the GPUs from bitcoin farms are often the most reliable.

Since they were run 24/7, there was rarely the kind of heat stress that kills cards (heating and cooling cycles).

Reply View 1 reply

buu700 8 hours ago

Could AI providers follow the same strategy? Just throw any spare inference capacity at something to make sure the GPUs are running 24/7, whether that's model training, crypto mining, protein folding, a "spot market" for non-time-sensitive/async inference workloads, or something else entirely.

Reply View | 0 replies