Comment by llm_trw

Comment by llm_trw 6 months ago

>The estimated training time for the end-to-end model on an 8×H100 machine is 2.6 days.

That's a $250,000 machine for the micro budget. Or if you don't want to do it locally ~$2,000 to do it on someone else's machine for the one model.

GaggiX 6 months ago

You can do it on one single GPU but you would need to use gradient accumulation and the training would probably last 1-2 months on a consumer GPU.

Reply View 1 reply

programd 6 months ago

Accepting the 1-2 month estimate at face value we're firmly in hobby territory now. Any adult with a job and a GPU can train their own models for an investement roughly equivelent to a high end gaming machine. Let's run some very hand wavy numbers:
RTX 4090 ($3000) + CPU/Motherboard/SSD/etc ($1600) + two months at full power ($300) is only on the order of $5000 initial investment for the first model. After that you can train 6 models per year to your exact specifications for an extra $150 per month in power usage. This cost will go down.
I'm expecting an explosion of micro-AI models specifically tailored for very narrow use cases. I mean Hugging face already has thousands of models, but they're mostly reusing the aligned big corporate stuff. What's coming is an avalanche of infinately creative micro-AI models, both good and bad. There are no moats.
It's going to be kind of like when teenagers got their hands on personal computers. Oh wait....

Reply View | 0 replies

godelski 6 months ago

From the abstract

  Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118× lower cost than stable diffusion models and 14× lower cost than the current state-of-the-art approach that costs $28,400.

Figure 1

  Qualitative evaluation of the image generation capabilities of our model (512×512 image resolution). Our model is trained in 2.6 days on a single 8×H100 machine (amounting to only $1,890 in GPU cost) without any proprietary or billion image dataset.

End of intro under the key contributions bullet points

  - Using a micro-budget of only $1,890, we train a 1.16 billion parameter sparse diffusion transformer on 37M images and a 75% masking ratio that achieves a 12.7 FID in zero-shot generation on the COCO dataset. The wall-clock time of our training is only 2.6 days on a single 8×H100 GPU machine, 14× lower than the current state-of-the-art approach that would take 37.6 training days ($28,400 GPU cost).

I'm just saying, the authors are not trying to hide this point. They are making it abundantly clear.

I should also mention that this is the most straightforward way to discuss pricing. It is going to be much more difficult if they do comparisons including the costs of the machines as then there needs to be an amortization cost baked in and that's going to have to include costs of electricity, supporting hardware, networking, how long the hardware is used for, at what percentage utility the hardware is, costs of employees to maintain, and all that fun stuff. Which... you can estimate by... GPU rental costs... Since they are in fact baking those numbers in. They explain their numbers in the appendix under Table 5. It is estimated at $3.75/H100/hr.

Btw, they also state a conversion to A100s

Reply View 2 replies

nickpsecurity 6 months ago

I've been collecting papers on straining models on small numbers of GPU's. What I look for is (a) type of GPU, (b) how many, and (c) how long it ran. I can quickly establish a minimum cost from that.
I say minimum because there's pre-processing data, setting up the machine configuration, trial runs on small data to make sure it's working, repeats during the main run if failures happened, and any time to load or offload data (eg checkpoints) from the GPU instance. So, the numbers in the papers are a nice minimum rather than the actual cost of a replication which is highly specific to one's circumstances.

Reply View | 1 reply
- godelski 6 months ago
  
  Sure... but they provide all that too. They're just saving most people extra work. And honestly, I think it is nice to have a historical record. There's plenty of times I'm looking in papers for numbers that don't seem relevant at the time but are later. Doesn't hurt.
  
  Reply View | 0 replies

vessenes 6 months ago

It’s still super cheap. I think the better nit to pick is that it’s almost a freebie after NVIDIA’s (hardware) and the rest of the world (software) billions of dollars of R&D and CapEx have gone into the full training stack. So, this is more like a blessing of the commons.

Reply View 0 replies

corysama 6 months ago

So, a kilo-budget.

Reply View 0 replies