Comment by ryao

Comment by ryao 6 months ago

> Take the Nvidia H100 – a massive GPU weighing in at 814mm2. Traditionally this chip would be very difficult to yield economically. But since its cores (SMs) are fault tolerant, a manufacturing defect does not knock out the entire product. The chip physically has 144 SMs but the commercialized product only has 132 SMs active. This means the chip could suffer numerous defects across 12 SMs and still be sold as a flagship part.

Fault tolerance seems to be the wrong term to use here. If I wrote this, I would have written redundant.

jjk166 6 months ago

Redundant cores lead to a fault tolerant chip.

Reply View 3 replies

ryao 6 months ago

ECC memory is fault tolerant. It repairs issues on the fly without disabling hardware. This on the other hand is merely redundant to handle manufacturing defects. If they make a mistake and ship a bad core that malfunctions at runtime, it is not going to tolerate that.

Reply View | 2 replies
- jjk166 6 months ago
  
  Redundancy is a method of providing fault tolerance, the existence of other methods doesn't make it less fault tolerant.
  Nothing is tolerant to all possible faults. Fault tolerance refers to being able to tolerate specific types of faults under specific conditions.
  Fault tolerant is the proper term for this.
  
  Reply View | 1 reply
  
  ryao 6 months ago
  
  I think it would have been better to write redundant. It is more specific.
  
  Reply View | 0 replies