100x defect tolerance: How we solved the yield problem

331 points by jwan584 a year ago

ChuckMcM a year ago

I think this is an important step, but it skips over that 'fault tolerant routing architecture' means you're spending die space on routes vs transistors. This is exactly analogous to using bits in your storage for error correcting vs storing data.

That said, I think they do a great job of exploiting this technique to create a "larger"[1] chip. And like storage it benefits from every core is the same and you don't need to get to every core directly (pin limiting).

In the early 2000's I was looking at a wafer scale startup that had the same idea but they were applying it to an FPGA architecture rather than a set of tensor units for LLMs. Nearly the exact same pitch, "we don't have to have all of our GLUs[2] work because the built in routing only uses the ones that are qualified." Xilinx was still aggressively suing people who put SERDES ports on FPGAs so they were pin limited overall but the idea is sound.

While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage. I appreciate the the amount of money people are willing to put at risk here allow for folks to try these "out of the box" kinds of ideas.

[1] It is physically more cores on a single die but the overall system is likely smaller, given the integration here.

[2] "Generic Logic Unit" which was kind of an extended LUT with some block RAM and register support.

Reply View 65 replies

dogcomplex a year ago

Of course many people are going to collectively lose trillions, AI's a very highly hyped industry with people racing into it without an intellectual edge and any temporary achievement by any one company will be quickly replicated and undercut by another using the same tools. Economic success of the individuals swarming on a new technology is not a guarantee whatsoever, nor is it an indicator of the impact of the technology.
Just like the dotcom bubble, AI is gonna hit, make a few companies stinking rich, and make the vast majority (of both AI-chasing and legacy) companies bankrupt. And it's gonna rewire the way everything else operates too.

Reply View | 8 replies
- idiotsecant a year ago
  
  >it's gonna rewire the way everything else operates too.
  This is the part that I think a lot of very tech literate people don't seem to get. I see people all the time essentially saying 'AI is just autocomplete' or pointing out that some vaporware ai company is a scam so surely everyone is.
  A lot of it is scams and flash in the pan. But a few of them are going to transform our lives in ways we probably don't even anticipate yet, for good and bad.
  
  Reply View | 5 replies
  
  Retric a year ago
  
  I’m not so sure it’s going to even do that much. People are currently happy to use LLM’s, but the outputs aren’t accurate and don’t seem to be improving quickly.
  A YouTuber watch regularly includes questions they asked Chat GPT and very single time there’s a detailed response in the comments showing how the output is wildly wrong from multiple mistakes.
  I suspect the backlash from disgruntled users is going to hit the industry hard and these models are still extremely expensive to keep updated.
  
  Reply View | 4 replies
- ithkuil a year ago
  
  Dollars are not lost; they are just very indirectly invested into gpu makers (and energy providers)
  
  Reply View | 0 replies
- Melomomololo a year ago
  
  [dead]
  
  Reply View | 0 replies
girvo a year ago

> Xilinx was still aggressively suing people who put SERDES ports on FPGAs
This so isn't important to your overall point, but where would I begin to look into this? Sounds fascinating!

Reply View | 2 replies
- ChuckMcM a year ago
  
  Well this was the patent they were threatening with as I recall (https://patents.google.com/patent/US20030023912A1/en) and there was this one too: https://patents.google.com/patent/US5576554A/en
  Basically the "secret sauce" of the startup recruiting me was that they were going to do wafer scale FPGAs that could be tiled together to build arbitrarily complex systems like military phased array radars and such. All very hush hush but apparently they had recruited some key talent from Xilinx which was annoying Xilinx.
  
  Reply View | 0 replies
- nroize a year ago
  
  Not OP but I was curious too. Here's all I could find that seemed related: https://www.businesswire.com/news/home/20200121005582/en/Xil...
  
  Reply View | 0 replies
enragedcacti a year ago

Any thoughts on why they are disabling so many cores in their current product? I did some quick noodling based on the 46/970000 number and the only way I ended up close to 900,000 was by assuming that an entire row or column would be disabled if any core within it was faulty. But doing that gave me a ~6% yield as most trials had active core counts in the high 800,000s

Reply View | 4 replies
- ChuckMcM a year ago
  
  I could guess that it helps with heat dissipation/management. But I don't know. That guess is from looking at the list of patents[1] they have.
  [1] https://patents.justia.com/assignee/cerebras-systems-inc
  
  Reply View | 0 replies
- projektfu a year ago
  
  They did mention that they stash extra cores to enable the re-routing. Those extra cores are presumably unused when not routed in.
  
  Reply View | 2 replies
  
  enragedcacti a year ago
  
  That was my first thought but based on the rerouting graphic it seems like the extra cores would be one or two rows and columns around the border which would only account for ~4000 cores.
  
  Reply View | 1 reply
  
  projektfu a year ago
  
  If the system were broken down into more subdivisions internally, there would be more cores dedicated to replacement. It seems like it could be more difficult to reroute an entire row or column of cores on a wafer than a small block. Perhaps, also, they are building in heavy redundancy for POC and in the future will optimize the number of cores they expect to lose.
  
  Reply View | 0 replies
__Joker a year ago

"While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage"
Can you please explain more why you think so ?
Thank you.

Reply View | 46 replies
- mschuster91 a year ago
  
  It's a hype cycle with many of the hypers and deciders having zero idea about what AI actually is and how it works. ChatGPT, while amazing, is at its core a token predictor, it cannot ever get to an AGI level that you'd assume to be competitive to a human, even most animals.
  And just as every other hype cycle, this one will crash down hard. The crypto crashes were bad enough but at least gamers got some very cheap GPUs out of all the failed crypto farms back then, but this time so much more money, particularly institutional money, is flowing around AI that we're looking at a repeat of Lehman's once people wake up and realize they've been scammed.
  
  Reply View | 36 replies
  
  dsign a year ago
  
  Those glorified token predictors are the missing piece in the puzzle of general intelligence. There is a long way to go still in putting all those pieces together, but I don't think any of the steps left are in the same order of "we need a miracle breakthrough".
  That said, I believe that this is going one of two ways: we use AI to make things materially harder for humans, in a scale from "you don't get this job" to "oops, this is Skynet", with many unpleasant stops in the middle. By the amount of money going into AI right now and most of the applications I'm seeing being hyped, I don't think we have have any scruples with this direction.
  The other way this can go, and Cerebras is a good example, is that we increase our compute capability and our AI-usefulness to a point where we can fight cancer and stop/revert aging, both being a computational problem at this point. Even if most people don't realize it, or most people have strong moral objections to this outcome and don't even want to talk about it, so it probably won't happen.
  In simpler words, I think we want to use AI to commit species suicide :-)
  
  Reply View | 2 replies
  
  KronisLV a year ago
  
  > And just as every other hype cycle, this one will crash down hard.
  Isn't that an inherent problem with pretty much everything nowadays: crypto, blockchain, AI, even the likes of serverless and Kubernetes, or cloud and microservices in general.
  There's always some hype cycle where the people who are early benefit and a lot of people chasing the hype later lose when the reality of the actual limitations and the real non-inflated utility of each technology hits. And then, a while later, it all settles down.
  I don't think the current "AI" is special in any way, it's just that everyone tries to get rich (or benefit in other ways, as in the microservices example, where you still very much had a hype cycle) quick without caring about the actual details.
  
  Reply View | 10 replies
  
  idiotsecant a year ago
  
  All the big LLMs are no longer just token predictors. They are beginning to incorporate memory, chain of thought, and other architectural tricks that use the token predictor in novel ways to produce some startlingly useful output.
  It's certainly the case that an LLM alone cannot achieve AGI. As a component of a larger system though? That remains to be seen. Maybe all we need to do is duct tape a limbic system and memory onto an LLM and the result is something sort of like an AGI.
  It's a little bit like saying that a ball bearing can't possibly ever be an internal combustion engine. While true, it's sidestepping the point a little bit.
  
  Reply View | 0 replies
  
  Shorel a year ago
  
  While I basically agree with everything you say, I have to add some caveats:
  ChatGPT, while being as far from true AGI as the Elisa chatbot written in Lisp, is extraordinarily more useful, and being used for many things that previously required humans to write the bullshit, like lobbying and propaganda.
  And Crypto... right now BTC is at an historical highest. It could even go higher. And it will eventually crash again. It's the nature of that beast.
  
  Reply View | 0 replies
  
  immibis a year ago
  
  Why do you think that an AGI can't be a token predictor?
  
  Reply View | 9 replies
  
  CamperBob2 a year ago
  
  it cannot ever get to an AGI level that you'd assume to be competitive to a human, even most animals.
  Suppose you turn out to be wrong. What would convince you?
  
  Reply View | 9 replies
- ChuckMcM a year ago
  
  I would guess you're not asking a serious question here but if you were feel free to contact me, it's why I put my email address in my profile.
  
  Reply View | 8 replies
  
  bigdict a year ago
  
  Why are you assuming bad faith?
  
  Reply View | 6 replies
  
  __Joker a year ago
  
  Really sorry, if the question came as snarky or if otherwise. Those were not my intent.
  Related to AI given all around noise, really wanted to understand kind of contrarian view of monetary aspects.
  Once again, apologies if the question seems frivolous.
  
  Reply View | 0 replies
[removed] a year ago

[deleted]

Reply View | 0 replies

ajb a year ago

So they massively reduce the area lost to defects per wafer, from 361 to 2.2 square mm. But from the figures in this blog, this is massively outweighed by the fact that they only get 46222 sq mm useable area out of the wafer, as opposed to 56247 that the H100 gets - because they are using a single square die instead of filling the circular wafer with smaller square dies, they lose 10,025 sq mm!

Not sure how that's a win.

Unless the rest of the wafer is useable for some other customer?

Reply View 38 replies

nine_k a year ago

It's a win because they have to test one chip, and don't have to spend resources on connecting the chiplets. The latter costs a lot (though it has other advantages). I suspect that a chiplet-based device with total 900k cores would just be not viable due to the size constraints.
If their routing around the defects is automated enough (given the highly regular structure), it may be a massive economy of efforts on testing and packaging the chip.

Reply View | 0 replies
ungreased0675 a year ago

Why does it have to be a square? There’s no need to worry about interchangeable third-party heat sink compatibility. Is it possible to make it an irregular polygon instead of square?

Reply View | 0 replies
kristjansson a year ago

Additional wafer area would be a marginal increase in performance (+~20% core core best case) but increases the complexity of their design, and requires they figure out how to package/connect/house/etc. a non-standard shape. A wafer scale chip is already a huge tech risk, why spend more novelty budget on nonessential weirdness?

Reply View | 0 replies
Scaevolus a year ago

Why does their chip have to be rectangular, anyways? Couldn't they cut out a (blocky) circle too?

Reply View | 17 replies
- Qwertious a year ago
  
  You need a rectilinear polygon that tessellates, and has the fewest sides possible to minimize the number of cuts necessary. And it would probably help the cutting if the shape is entirely convex, so that cuts can overshoot a bit without damaging anything.
  That suggests a rectangle is the only possible shape.
  
  Reply View | 2 replies
  
  CorrectHorseBat a year ago
  
  If it's just one chip per wafer, why even bother cutting?
  
  Reply View | 0 replies
  
  timerol a year ago
  
  Why does it need to tessellate if there's only one chip per wafer?
  
  Reply View | 0 replies
- nine_k a year ago
  
  Rather I wonder why do they even need to cut the extra space, instead of putting something there. I suppose that the structure of the device is highly rectangular from the logical PoV, so there's nothing useful to put there. I suspect smaller unrelated chips can be produced on these areas along the way.
  
  Reply View | 0 replies
- guyzero a year ago
  
  I've never cut a wafer, but I assume cutting is hard and single straight lines are the easiest.
  
  Reply View | 9 replies
  
  sroussey a year ago
  
  I wonder if you could… just not cut the wafer at all??
  
  Reply View | 8 replies
- yannyu a year ago
  
  The cost driver for fabbing out wafers is the number of layers and the number of usable devices per wafer. Higher layer count increases cost and tends to decrease yield, and more robust designs with higher yields increase usable devices per wafer. If circles or other shapes could help with either of those, they would likely be used. Generally the end goal is to have the most usable devices per wafer, so they'll be packed as tightly as possible on the wafer so as to have the highest potential output.
  
  Reply View | 1 reply
  
  Scaevolus a year ago
  
  Right, but they're making just one usable device per wafer already.
  
  Reply View | 0 replies
- [removed] a year ago
  
  [deleted]
  
  Reply View | 0 replies
olejorgenb a year ago

Is the wafer itself so expensive? I assume they don't pattern the unused area, so the process should be quicker?

Reply View | 14 replies
- addaon a year ago
  
  > I assume they don't pattern the unused area
  I’m out of date on this stuff, so it’s possible things have changed, but I wouldn’t make that assumption. It is (used to be?) standard to pattern the entire wafer, with partially-off-the-wafer dice around the edges of the circle. The reason for this is that etching behavior depends heavily on the surrounding area — the amount of silicon or copper whatever etched in your neighborhood affects the speed of etching for you, which effects line width, and (for a single mask used for the whole wafer) thus either means you need to have more margin on your parameters (equivalent to running on an old process) or have a higher defect right near the edge of the die (which you do anyway, since you can only take “similar neighborhood” so far). This goes as far as, for hyper-optimized things like SRAM arrays, leaving an unused row and column at each border of the array.
  
  Reply View | 1 reply
  
  kurthr a year ago
  
  All the process steps are limited by wafers for hour. Lithography (esp EUV) might be slightly faster, but that's not 30% of total steps, since you generally have deposit and etch/implant for every lithography step.
  It's close to a dead loss in process cost.
  
  Reply View | 0 replies
- yannyu a year ago
  
  > I assume they don't pattern the unused area, so the process should be quicker?
  The primary driver of time and cost in the fabrication process is the number of layers for the wafers, not the surface area, since all wafers going through a given process are the same size. So you generally want to maximize the number of devices per wafer, because a large part of your costs will be calculated at the per-wafer level, not a per-device level.
  
  Reply View | 5 replies
  
  mattashii a year ago
  
  Yes, but isn't a big driver of layer costs the cost of the machines to build those layers?
  For patterning, a single iteration could be (example values, no actual values used, probably only ballpark accuracy) on a 300M$ EUV machine with 5-year write off cycle, patterns on average 180 full wafers /hour. Excluding energy usage and service time, each wafer that needs full patterning would cost ~38$. If each wafer only needed half the area patterned, the lithography machine might only spend half its usual time on such a wafer, and that could double the throughput of the EUV machine, halving the write-off based cost component of such a patterning step.
  Given that each layer generally consists of multiple patterning steps, a 10-20% reduction in those steps could give a meaningful reduction in time spent in the machines whose time spend on the wafer depends on the used wafer area.
  This of course doesn't help reduce time in polishing or etching (and other steps that happen with whole wafers at a time), so it won't be as straightforward as % reduction in wafer area usage == % reduction in cost, but I wouldn't be surprised if it was a meaningful percentage.
  
  Reply View | 1 reply
  
  yannyu a year ago
  
  > Yes, but isn't a big driver of layer costs the cost of the machines to build those layers?
  Let's say the time spent in lithography step is linear the way you're describing. Even with that, the deposition step beforehand is surface area independent and would be applied across the entire wafer, and takes just as long if not longer than the lithography.
  Additionally, if you were going to build a fab ground up for some specific purpose, then you might optimize the fab for those specific devices as you lay out. But most of these companies are not doing that and are simply going through TSMC or a similar subcontractor. So you've got an additional question of how far TSMC will go to accommodate customers who only want to use half a wafer, and whether that's the kind of project they could profitably cater to.
  
  Reply View | 0 replies
  
  olejorgenb a year ago
  
  Yes, but my understanding is that the wafer is exposed in multiple steps, so there would still be less exposure steps? Probably insignificant compared to all the rest though. (Etching, moving the wafer, etc.)
  EDIT: to clarify - I mean the exposure of one single pattern/layer is done in multiple steps. (https://en.wikipedia.org/wiki/Photolithography#Projection)
  
  Reply View | 2 replies
- pulvinar a year ago
  
  There's also no reason they couldn't pattern that area with some other suitable commodity chips. Like how sawmills and butchers put all cuts to use.
  
  Reply View | 1 reply
  
  sitkack a year ago
  
  Often those areas are used for test chips and structures for the next version. They are effectively free, so you can use them to test out ideas.
  
  Reply View | 0 replies
- ajb a year ago
  
  Good question. I think the wafer has a cost per area which is fairly significant, but I don't have any figures. There has historically been a push to utilise them more efficiently, eg by building fabs that can process larger wafers. Although mask exposure would be per processed area, I think that there are also some proportion of processing time which is per wafer, so the unprocessed area would have an opportunity cost relating to that.
  
  Reply View | 0 replies
- kristjansson a year ago
  
  AIUI Wafer marginal cost is lower than you'd expect. I had $50k in my head, quick google indicates[1] maybe <$20k at AAPL volumes? Regardless seems like the economics for Cerebras would strongly favor yield over wafer area utilization.
  [1] https://www.tomshardware.com/tech-industry/tsmcs-wafer-prici...
  
  Reply View | 0 replies
- [removed] a year ago
  
  [deleted]
  
  Reply View | 0 replies
- georgeburdell a year ago
  
  They probably pattern at least next nearest neighbors for local uniformity. That’s just litho though. The rest of the process is done all at once on the wafer
  
  Reply View | 0 replies
sroussey a year ago

It’s a win if you can use the wafer as opposed to throwing it away.

Reply View | 1 reply
- kristjansson a year ago
  
  A win is a manufacturing process that results in a functioning product. Wafers, etc. aren't so scarce as to demand every mm2 be used on every one every time.
  
  Reply View | 0 replies

NickHoff a year ago

Neat. What about power density?

An H100 has a TDP of 700 watts (for the SXM5 version). With a die size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the same power density, that means a cerebras TDP of 37.8 kW.

That's a lot. Let's say you cover the whole die area of the chip with water 1 cm deep. How long would it take to boil the water starting from room temperature (20 degrees C)?

amount of water = (die area of 46225 mm^2) * (1 cm deep) * (density of water) = 462 grams

energy needed = (specific heat of water) * (80 kelvin difference) * (462 grams) = 154 kJ

time = 154 kJ / 39.8 kW = 3.9 seconds

This thing will boil (!) a centimeter of water in 4 seconds. A typical consumer water cooler radiator would reduce the temperature of the coolant water by only 10-15 C relative to ambient, and wouldn't like it (I presume) if you pass in boiling water. To use water cooling you'd need some extreme flow rate and a big rack of radiators, right? I don't really know. I'm not even sure if that would work. How do you cool a chip at this power density?

Reply View 20 replies

Paul_Clayton a year ago

The enthalpy of vaporization of water (at standard pressure) is listed by Wikipedia[1] as 2.257 kJ/g, so boiling 462 grams would require an additional 1.04 MJ, adding 26 seconds. Cerebras claims a "peak sustained system power of 23kW" for the CS-3 16 Rack Unit system[2], so clearly the power density is lower than for an H100.
[1] https://en.wikipedia.org/wiki/Enthalpy_of_vaporization#Other... [2] https://cerebras.ai/product-system/

Reply View | 6 replies
- twic a year ago
  
  On a tangent: has anyone built an active cooling system which operates in a partial vacuum? At half atmospheric pressure, water boils at around 80 C, which i believe is roughly the operating temperature for a hard-working chip. You could pump water onto the chip, have it vapourise, taking away all that heat, then take the vapour away and condense it at the fan end.
  This is how heat pipes work, i believe, but heat pipes aren't pumped, they rely entirely on heat-driven flow. I would have thought there were pumped heat pipes. Are they called something else?
  It's also not a refrigerator, because those use a pump to pressurise the coolant in its gas phase, whereas here you would only be pumping the water.
  
  Reply View | 5 replies
  
  pants2 a year ago
  
  No need to bother with a partial vacuum when ethanol boils at around 80 C as well and doesn't destroy electronics. I'm not aware of any active cooling systems utilizing this though.
  
  Reply View | 2 replies
  
  TehCorwiz a year ago
  
  I found this review from 2019 of mechanically pumped heat pipe technologies. I skimmed the intro. Looks like it already has a foothold in aerospace.
  https://www.sciencedirect.com/science/article/abs/pii/S13594...
  
  Reply View | 0 replies
  
  Dylan16807 a year ago
  
  > This is how heat pipes work, i believe, but heat pipes aren't pumped, they rely entirely on heat-driven flow. I would have thought there were pumped heat pipes.
  Do you have a particular benefit in mind that a pump would help with?
  
  Reply View | 0 replies
buildbot a year ago

A Very Fancy cooling engine: https://www.eetimes.com/powering-and-cooling-a-wafer-scale-d...

Reply View | 0 replies
jwan584 a year ago

A good talk on how Cerebras does power & cooling (8min) https://www.youtube.com/watch?v=wSptSOcO6Vw&ab_channel=Appli...

Reply View | 0 replies
throwup238 a year ago

The machine that actually holds one of their wafers is almost as impressive as the chip itself. Tons of water cooling channels and other interesting hardware for cooling.

Reply View | 0 replies
flopsamjetsam a year ago

Minor correction, the keynote video says ~20 kW

Reply View | 0 replies
lostlogin a year ago

If rack mounted, you are ending up with something like a reverse power station.
So why not use it as an energy source? Spin a turbine.

Reply View | 8 replies
- kristjansson a year ago
  
  If you let the chip actual boil enough water to run a turbine you're going to have a hard time keeping the magic smoke inside. Much better to run at reasonable temps and try to recover energy from the waste heat.
  
  Reply View | 2 replies
  
  ericye16 a year ago
  
  What if you chose a refrigerant with a lower boiling point?
  
  Reply View | 1 reply
  
  kristjansson a year ago
  
  That's basically the principle of binary cycle[1] generators. However for data center waste heat recovery, I'd think you'd want to use a more stable fluid for cooling, and then pump it to a separate closed-loop binary-cycle generator. No reason to make your datacenter cooling system also deal with high pressure fluids, and moving high pressure working fluid from 1000s of chips to a turbine of sufficient size, etc.
  [1]: https://en.wikipedia.org/wiki/Binary_cycle
  
  Reply View | 0 replies
- renhanxue a year ago
  
  There's a bunch of places in Europe that use waste heat from datacenters in district heating systems. Same thing with waste heat from various industrial processes. It's relatively common practice.
  
  Reply View | 0 replies
- sebzim4500 a year ago
  
  If my very stale physics is accurate then even with perfect thermodynamic efficiency you would only recover about a third of the energy that you put into the chips.
  
  Reply View | 1 reply
  
  dylan604 a year ago
  
  1/3 > 0, so even if you don't get a $0 energy bill I'd venture that any company that could get 1/3 of energy bill would be happy
  
  Reply View | 0 replies
- bentcorner a year ago
  
  I'm aware of the efficiency losses but I think it would be amusing to use that turbine to help power the machine generating the heat.
  
  Reply View | 1 reply
  
  twic a year ago
  
  Hey, we're building artificial general intelligence, what's a little perpetual motion on the side?
  
  Reply View | 0 replies

highfrequency a year ago

To summarize: localize defect contamination to a very small unit size, by making the cores tiny and redundant.

Analogous to a conglomerate wrapping each business vertical in a limited liability veil so that lawsuits and bankruptcy do not bring down the whole company. The smaller the subsidiaries, the less defect contamination but also the less scope for frictionless resource and information sharing.