Comment by hxorr
I watch a GPU repair guy and its interesting to see the different failure modes...
* memory IC failure
* power delivery component failure
* dead core
* cracked BGA solder joints on core
* damaged PCB due to sag
These issues are compounded by
* huge power consumption and heat output of core and memory, compared to system CPU/memory
* physical size of core leads to more potential for solder joint fracture due to thermal expansion/contraction
* everything needs to fit in PCIe card form factor
* memory and core not socketed, if one fails (or supporting circuitry on the PCB fails) then either expensive repair or the card becomes scrap
* some vendors have cards with design flaws which lead to early failure
* sometimes poor application of thermal paste/pads at factory (eg, only half of core making contact
* and, in my experience in aquiring 4-5 year old GPUs to build gaming PCs with (to sell), almost without fail the thermal paste has dried up and the card is thermal throttling
These failures of consumer GPUs may be not applicable to datacenter GPUs, as the datacenter ones are used differently, in a controlled environment, have completely different PCBs, different cooling, different power delivery, and are designed for reliability under constant max load.