Comment by godelski
From the abstract
Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118× lower cost than stable diffusion models and 14× lower cost than the current state-of-the-art approach that costs $28,400.
Figure 1 Qualitative evaluation of the image generation capabilities of our model (512×512 image resolution). Our model is trained in 2.6 days on a single 8×H100 machine (amounting to only $1,890 in GPU cost) without any proprietary or billion image dataset.
End of intro under the key contributions bullet points - Using a micro-budget of only $1,890, we train a 1.16 billion parameter sparse diffusion transformer on 37M images and a 75% masking ratio that achieves a 12.7 FID in zero-shot generation on the COCO dataset. The wall-clock time of our training is only 2.6 days on a single 8×H100 GPU machine, 14× lower than the current state-of-the-art approach that would take 37.6 training days ($28,400 GPU cost).
I'm just saying, the authors are not trying to hide this point. They are making it abundantly clear.I should also mention that this is the most straightforward way to discuss pricing. It is going to be much more difficult if they do comparisons including the costs of the machines as then there needs to be an amortization cost baked in and that's going to have to include costs of electricity, supporting hardware, networking, how long the hardware is used for, at what percentage utility the hardware is, costs of employees to maintain, and all that fun stuff. Which... you can estimate by... GPU rental costs... Since they are in fact baking those numbers in. They explain their numbers in the appendix under Table 5. It is estimated at $3.75/H100/hr.
Btw, they also state a conversion to A100s
I've been collecting papers on straining models on small numbers of GPU's. What I look for is (a) type of GPU, (b) how many, and (c) how long it ran. I can quickly establish a minimum cost from that.
I say minimum because there's pre-processing data, setting up the machine configuration, trial runs on small data to make sure it's working, repeats during the main run if failures happened, and any time to load or offload data (eg checkpoints) from the GPU instance. So, the numbers in the papers are a nice minimum rather than the actual cost of a replication which is highly specific to one's circumstances.