Comment by ryao

Comment by ryao 10 months ago

4 replies

> Take the Nvidia H100 – a massive GPU weighing in at 814mm2. Traditionally this chip would be very difficult to yield economically. But since its cores (SMs) are fault tolerant, a manufacturing defect does not knock out the entire product. The chip physically has 144 SMs but the commercialized product only has 132 SMs active. This means the chip could suffer numerous defects across 12 SMs and still be sold as a flagship part.

Fault tolerance seems to be the wrong term to use here. If I wrote this, I would have written redundant.

jjk166 10 months ago

Redundant cores lead to a fault tolerant chip.

  • ryao 10 months ago

    ECC memory is fault tolerant. It repairs issues on the fly without disabling hardware. This on the other hand is merely redundant to handle manufacturing defects. If they make a mistake and ship a bad core that malfunctions at runtime, it is not going to tolerate that.

    • jjk166 10 months ago

      Redundancy is a method of providing fault tolerance, the existence of other methods doesn't make it less fault tolerant.

      Nothing is tolerant to all possible faults. Fault tolerance refers to being able to tolerate specific types of faults under specific conditions.

      Fault tolerant is the proper term for this.

      • ryao 10 months ago

        I think it would have been better to write redundant. It is more specific.