Comment by hxorr

Comment by hxorr 12 hours ago

2 replies

View on Hacker News

I watch a GPU repair guy and its interesting to see the different failure modes...

* memory IC failure

* power delivery component failure

* dead core

* cracked BGA solder joints on core

* damaged PCB due to sag

These issues are compounded by

* huge power consumption and heat output of core and memory, compared to system CPU/memory

* physical size of core leads to more potential for solder joint fracture due to thermal expansion/contraction

* everything needs to fit in PCIe card form factor

* memory and core not socketed, if one fails (or supporting circuitry on the PCB fails) then either expensive repair or the card becomes scrap

* some vendors have cards with design flaws which lead to early failure

* sometimes poor application of thermal paste/pads at factory (eg, only half of core making contact

* and, in my experience in aquiring 4-5 year old GPUs to build gaming PCs with (to sell), almost without fail the thermal paste has dried up and the card is thermal throttling

oskarkk 8 hours ago

These failures of consumer GPUs may be not applicable to datacenter GPUs, as the datacenter ones are used differently, in a controlled environment, have completely different PCBs, different cooling, different power delivery, and are designed for reliability under constant max load.

Reply View 1 reply

fennecbutt 5 hours ago

Yeah you're right. Definitely not applicable at all. Especially since nvidia often supplies them tied into the dgx units with cooling etc. Ie a controlled environment.
Consuker gpu you have no idea if they've shoved it into a hotbox of a case or not

Reply View | 0 replies