Comment by yk

Comment by yk 21 hours ago

1 reply

Tried Flux.dev with the same prompts [0] and it seems actually to be a GPT problem. Could be that in GPT the text encoder understands the prompt better and just generates the implied IP, or could be that a diffusion model is just inherently less prone to overfitting than a multimodal transformer model.

[0] https://imgur.com/a/wqrBGRF Image captions are the impled IP, I copied the prompts from the blog post.

jsemrau 21 hours ago

DALL-E 3 already uses a model that trained on synthetic data that take the prompt and augments it. This might lead to the overfitting. It could also be, and might be the simpler explanation, that its just looks up the right file from a RAG.