Comment by chermi
Do we actually know how they're degrading? Are there still Pascals out there? If not, is it because they actual broke or because of poor performance? I understand it's tempting to say near 100% workload for multiple years = fast degradation, but what are the actual stats? Are you talking specifically about the actual compute chip or the whole compute system -- I know there's a big difference now with the systems Nvidia is selling. How long do typical Intel/AMD CPU server chips last? My impression is a long time.
If we're talking about the whole compute system like a gb200, is there a particular component that breaks first? How hard are they to refurbish, if that particular component breaks? I'm guessing they didn't have repairability in mind, but I also know these "chips" are much more than chips now so there's probably some modularity if it's not the chip itself failing.
I watch a GPU repair guy and its interesting to see the different failure modes...
* memory IC failure
* power delivery component failure
* dead core
* cracked BGA solder joints on core
* damaged PCB due to sag
These issues are compounded by
* huge power consumption and heat output of core and memory, compared to system CPU/memory
* physical size of core leads to more potential for solder joint fracture due to thermal expansion/contraction
* everything needs to fit in PCIe card form factor
* memory and core not socketed, if one fails (or supporting circuitry on the PCB fails) then either expensive repair or the card becomes scrap
* some vendors have cards with design flaws which lead to early failure
* sometimes poor application of thermal paste/pads at factory (eg, only half of core making contact
* and, in my experience in aquiring 4-5 year old GPUs to build gaming PCs with (to sell), almost without fail the thermal paste has dried up and the card is thermal throttling