Comment by layoric

Comment by layoric 16 hours ago

> Unlike railroads and fibre, all the best compute in 2025 will be lacklustre in 2027.

I definitely don't think compute is anything like railroads and fibre, but I'm not so sure compute will continue it's efficiency gains of the past. Power consumption for these chips is climbing fast, lots of gains are from better hardware support for 8bit/4bit precision, I believe yields are getting harder to achieve as things get much smaller.

Betting against compute getting better/cheaper/faster is probably a bad idea, but fundamental improvements I think will be a lot slower over the next decade as shrinking gets a lot harder.

palmotea 15 hours ago

>> Unlike railroads and fibre, all the best compute in 2025 will be lacklustre in 2027.

> I definitely don't think compute is anything like railroads and fibre, but I'm not so sure compute will continue it's efficiency gains of the past. Power consumption for these chips is climbing fast, lots of gains are from better hardware support for 8bit/4bit precision, I believe yields are getting harder to achieve as things get much smaller.

I'm no expert, buy my understanding is that as feature sizes shrink, semiconductors become more prone to failure over time. Those GPUs probably aren't going to all fry themselves in two years, but even if GPUs stagnate, chip longevity may limit the medium/long term value of the (massive) investment.

Reply View 0 replies

spiderice 15 hours ago

Unfortunately changing 2027 to 2030 doesn't make the math much better

Reply View 3 replies

JumpCrisscross 10 hours ago

> changing 2027 to 2030 doesn't make the math much better
Could you show me?
Early turbines didn't last that long. Even modern ones are only rated for a few decades.

Reply View | 2 replies
- singularity2001 4 hours ago
  
  there is a difference between a few decades and half a decade though? Or his time in general accelerated so much that it's basically very similar
  
  Reply View | 1 reply
  
  JumpCrisscross 4 hours ago
  
  “Within Parsons' lifetime, the generating capacity of a [steam turbine] unit was scaled up by about 10,000 times” [1].
  For comparison, Moore’s law (at 2 years per doubling) scales 4 orders of magnitude in about 27 years. That’s roughly the lifetime of a modern steam turbine [2]. In actuality, Parsons lived 77 years [3], implying a 13% growth rate, so doubling every 6 versus 2 years. But within the same order of magnitude.
  [1] https://en.m.wikipedia.org/wiki/Steam_turbine
  [2] https://alliedpg.com/latest-articles/life-extension-strategi... 30 years
  [3] https://en.m.wikipedia.org/wiki/Charles_Algernon_Parsons
  
  Reply View | 0 replies

skywhopper 15 hours ago

Unfortunately the chips themselves probably won’t physically last much longer than that under the workloads they are being put to. So, yes, they won’t be totally obsolete as technology in 2028, but they may still have to be replaced.

Reply View 9 replies

munk-a 14 hours ago

Yeah - I think that the extremely fast depreciation just due to wear and use on GPUs is pretty unappreciated right now. So you've spent 300 mil on a brand new data center - congrats - you'll need to pay off that loan and somehow raise another 100 mil to actually maintain that capacity for three years based on chip replacement alone.
There is an absolute glut of cheap compute available right now due to VC and other funds dumping into the industry (take advantage of it while it exists!) but I'm pretty sure Wall St. will balk when they realize the continued costs of maintaining that compute and look at the revenue that expenditure is generating. People think of chips as a piece of infrastructure - you buy a personal computer and it'll keep chugging for a decade without issue in most case - but GPUs are essentially consumables - they're an input to producing the compute a data center sells that needs constant restocking - rather than a one-time investment.

Reply View | 1 reply
- davedx an hour ago
  
  There are some nuances there.
  - Most big tech companies are investing in data centers using operating cash flow, not levering it
  - The hyperscalers have in recent years been tweaking the depreciation schedules of regular cloud compute assets (extending them), so there's a push and a pull going on for CPU vs GPU depreciation
  - I don't think anyone who knows how to do fundamental analysis expects any asset to "keep chugging for a decade without issue" unless it's explicitly rated to do so (like e.g. a solar panel). All assets have depreciation schedules, GPUs are just shorter than average, and I don't think this is a big mystery to big money on Wall St
  
  Reply View | 0 replies
chermi 12 hours ago

Do we actually know how they're degrading? Are there still Pascals out there? If not, is it because they actual broke or because of poor performance? I understand it's tempting to say near 100% workload for multiple years = fast degradation, but what are the actual stats? Are you talking specifically about the actual compute chip or the whole compute system -- I know there's a big difference now with the systems Nvidia is selling. How long do typical Intel/AMD CPU server chips last? My impression is a long time.
If we're talking about the whole compute system like a gb200, is there a particular component that breaks first? How hard are they to refurbish, if that particular component breaks? I'm guessing they didn't have repairability in mind, but I also know these "chips" are much more than chips now so there's probably some modularity if it's not the chip itself failing.

Reply View | 5 replies
- hxorr 10 hours ago
  
  I watch a GPU repair guy and its interesting to see the different failure modes...
  * memory IC failure
  * power delivery component failure
  * dead core
  * cracked BGA solder joints on core
  * damaged PCB due to sag
  These issues are compounded by
  * huge power consumption and heat output of core and memory, compared to system CPU/memory
  * physical size of core leads to more potential for solder joint fracture due to thermal expansion/contraction
  * everything needs to fit in PCIe card form factor
  * memory and core not socketed, if one fails (or supporting circuitry on the PCB fails) then either expensive repair or the card becomes scrap
  * some vendors have cards with design flaws which lead to early failure
  * sometimes poor application of thermal paste/pads at factory (eg, only half of core making contact
  * and, in my experience in aquiring 4-5 year old GPUs to build gaming PCs with (to sell), almost without fail the thermal paste has dried up and the card is thermal throttling
  
  Reply View | 2 replies
  
  oskarkk 6 hours ago
  
  These failures of consumer GPUs may be not applicable to datacenter GPUs, as the datacenter ones are used differently, in a controlled environment, have completely different PCBs, different cooling, different power delivery, and are designed for reliability under constant max load.
  
  Reply View | 1 reply
  
  fennecbutt 3 hours ago
  
  Yeah you're right. Definitely not applicable at all. Especially since nvidia often supplies them tied into the dgx units with cooling etc. Ie a controlled environment.
  Consuker gpu you have no idea if they've shoved it into a hotbox of a case or not
  
  Reply View | 0 replies
- Workaccount2 10 hours ago
  
  Believe it or not, the GPUs from bitcoin farms are often the most reliable.
  Since they were run 24/7, there was rarely the kind of heat stress that kills cards (heating and cooling cycles).
  
  Reply View | 1 reply
  
  buu700 6 hours ago
  
  Could AI providers follow the same strategy? Just throw any spare inference capacity at something to make sure the GPUs are running 24/7, whether that's model training, crypto mining, protein folding, a "spot market" for non-time-sensitive/async inference workloads, or something else entirely.
  
  Reply View | 0 replies
epolanski 12 hours ago

I'm not sure.
Number of cycles that goes through silicon matters, but what matters most really are temperature and electrical shocks.
If the GPUs are stable, at low temperature they can be at full load for years. There are servers out there up from decades and decades.

Reply View | 0 replies

[removed] 6 hours ago

[deleted]

Reply View 0 replies