Comment by andix

Comment by andix 17 hours ago

0 replies

Failure rates also go up. For AI inference it’s probably not too bad in most cases, just take the node offline and re-schedule the jobs to other nodes.