Comment by llm_trw
>The estimated training time for the end-to-end model on an 8×H100 machine is 2.6 days.
That's a $250,000 machine for the micro budget. Or if you don't want to do it locally ~$2,000 to do it on someone else's machine for the one model.
You can do it on one single GPU but you would need to use gradient accumulation and the training would probably last 1-2 months on a consumer GPU.