Comment by giancarlostoro

Comment by giancarlostoro 2 days ago

I dont even get this trend, wouldnt OpenAI be buying ECC RAM only anyway? Who in their right mind runs this much infrastructure on NON ECC RAM??? Makes no sense to me. Same with GPUs they aren't buying your 5090s. Peoples perception is wild to me.

jsheard 2 days ago

OpenAI bought out Samsung and SK Hynixes DRAM wafers in advance, so they'll prioritize producing whatever OpenAI wants to deploy whether that's DDR/LPDDR/GDDR/HBM, with or without ECC. That means way less wafers for everything else so even if you want a different spec you're still shit out of luck.

Reply View 2 replies

nirui 2 days ago

You forgot to mention that everyone else also raised their price because, you know, who don't like free money.
Last year I brought two 8G DDR3L RAM stick made by Gloway for around $8 each, now the same stick is priced around $22, a 275% increase in price.
SSD makers are also increasing their prices, but that started one or two years ago, and they did it again recently (of course).
It looked like I'll not be buying any first-hand computers/parts before the price can go normal again.

Reply View | 1 reply
- wqaatwt a day ago
  
  > you know, who don't like free money.
  Yes but otherwise you’d get huge shortages and would be unlikely to be able to buy it at all. Also a significant proportion of the surplus currently going to manufacturers/etc. would go to various scalper and resellers
  
  Reply View | 0 replies

crote 2 days ago

ECC memory is a bit like RAID: A consumer-level RAM stick will (traditionally) have 8 8-bit-wide chips operating basically in RAID-0 to provide 64-bit-wide access, whereas enterprise-level RAM sticks will operate with 9 8-bit-wide chips in something closer to RAID-4 or -5.

But they are all exactly the same chips. The ECC magic happens in the memory controller, not the RAM stick. Anyone buying ECC RAM for servers is buying on the same market as you building a new desktop computer.

Reply View 3 replies

kvemkon a day ago

> enterprise-level RAM sticks will operate with 9 8-bit-wide chips
Since DDR5 has 2 independent subchannels, 2 additional chips are needed.

Reply View | 0 replies
embedding-shape 2 days ago

> Anyone buying ECC RAM for servers is buying on the same market as you building a new desktop computer.
Even when the sticks are completely incompatible with each other? I think servers tend to use RDIMM, desktops use UDIMM. Personally I'm not seeing as step increase in (b2b) RDIMMs compared to the same stores selling UDIMM (b2c), but I'm also looking at different stores tailored towards different types of users.

Reply View | 1 reply
- StrLght 2 days ago
  
  The expensive part is DRAM chips. They drive prices for sticks.
  
  Reply View | 0 replies

drum55 2 days ago

At the chip level there’s no difference as far as I’m aware, you just have 9 bits per byte rather than 8 bits per byte physically on the module. More chips but not different chips.

Reply View 1 reply

cesarb 2 days ago

> you just have 9 bits per byte rather than 8 bits per byte physically on the module. More chips but not different chips.
For those who aren't well versed in the construction of memory modules: take a look at your DDR4 memory module, you'll see 8 identical chips per side if it's a non-ECC module, and 9 identical chips per side if it's an ECC module. That's because, for every byte, each bit is stored in a separate chip; the address and command buses are connected in parallel to all of them, while each chip gets a separate data line on the memory bus. For non-ECC memory modules, the data line which would be used for the parity/ECC bit is simply not connected, while on ECC memory modules, it's connected to the 9th chip.
(For DDR5, things are a bit different, since each memory module is split in two halves, with each half having 4 or 5 chips per side, but the principle is the same.)

Reply View | 0 replies

raddan 2 days ago

I seriously doubt that single bit errors on the scale of OpenAI workloads really matters very much, particularly for a domain that is already noisy.

Reply View 2 replies

PunchyHamster 2 days ago

Till they hit your program memory. We just had really interesting incident where one of the Ceph nodes didn't fail but started acting erratically, bringing whole cluster to a crawl, once a failing RAM module had some uncorrectable errors.
And that was caught because we had ECC. If not for that we'd be replacing drives, because metrics made it look like it is one of OSDs slowing to a crawl, which usual reason is drive dying.
Of course, chance for that is pretty damn small, bit also their scale is pretty damn big.

Reply View | 0 replies
close04 2 days ago

Random bit flips is their best path to AGI.

Reply View | 0 replies

MangoToupe 2 days ago

On the flipside, LLMs are so inconsistent you might argue ECC is a complete waste of money. But Open Ai wasting money is hardly anything new.

Reply View 1 reply

kvemkon a day ago

Using digital chips instead of some novel analog approach is even greater waste.
> China's AI Analog Chip Claimed to Be 3000X Faster Than Nvidia's A100 GPU (04.11.2023)
https://news.ycombinator.com/item?id=38144619
> Q.ANT’s photonic chips – which compute using light instead of electricity – promise to deliver a 30-fold increase in energy efficiency and a 50-fold boost in computing speed, offering transformative potential for AI-driven data centers and HPC environments. (24.02.2025)
https://qant.com/press-releases/q-ant-and-ims-chips-launch-p...

Reply View | 0 replies

MisterTea 2 days ago

ECC modules use the same chips as non ECC modules so it eats into the consumer market too.

Reply View 4 replies

officialchicken 2 days ago

Good point! But they are slightly more energy hungry. At these scales I wonder if Stargate could go with one less nuclear reactor simply by switching to non-ECC RAM

Reply View | 3 replies
- Majromax 2 days ago
  
  Penny-wise and pound foolish. Non-ECC RAM might save on the small amount of RAM power, but if a bit-flip causes a failed computation then an entire forwards/backwards step – possibly involving several nodes – might need to be redone.
  
  Reply View | 2 replies
  
  hylaride 2 days ago
  
  Linus Torvalds was recently on Linux Tech Tips to build a new computer and he insisted on ECC RAM. Torvalds was convinced that memory errors are a much greater problem for stability than otherwise posted and he's spent an inordinate amount of time chasing phantom bugs because of it.
  https://www.youtube.com/watch?v=mfv0V1SxbNA
  
  Reply View | 0 replies
  
  coldtea 2 days ago
  
  >but if a bit-flip causes a failed computation then an entire forwards/backwards step – possibly involving several nodes – might need to be redone.
  Which for the most part it would be an irrelevant cost-of-doing business compared to the huge savings from non-ECC and how incosequential it is if some ChatGPT computation fails...
  
  Reply View | 0 replies

KeplerBoy 2 days ago

The 5090 is the same chip as the workstation RTX 6000.

Of course OpenAI is also not buying that but B200 DGX systems, but that is still the same process at TSMC.

Reply View 0 replies

coldtea 2 days ago

ECC RAMs utility is overblown. Major companies often use off-the-shelves non enterprise parts for huge server installations, including regular RAM. The rare bit flipping is hardly a major concern at their scale, and for their specific purposes.

Reply View 2 replies

wtallis 2 days ago

Most server CPUs require RDIMMs, and while non-ECC RDIMMs exist, they are not a high-volume product and are intended for workstations rather than servers. The used parts market would look very different if there were lots of large-scale server deployments using non-ECC memory modules.

Reply View | 0 replies
Glemkloksdjf 2 days ago

Do you have a source for this?
I would not want to rerun a whole run just because of bit flips and bit flips become a lot more relevant the more servers you need.

Reply View | 0 replies