Comment by doctorpangloss

Comment by doctorpangloss 2 days ago

8 replies

Reference implementations are unmaintained and buggy.

For example https://github.com/huggingface/transformers/issues/27961 OpenAI's tokenizer for CLIP is buggy, it's a reference implementation, it isn't the one they used for training, and the problems with it go unsolved and get copied endlessly by other projects.

What about Flux? They don't say it was used for training, it wasn't, there are bugs with it that break cudagraphs or similar that aren't that impactful. On the other hand, it uses CLIP reference, and CLIP reference is buggy, so this is buggy...

liuliu 2 days ago

Congrats on finding a bug!

However, the keyword here is training / inference divergence. Unfortunately, nobody is going to spend multi-million to retrain a model, so our reimplementation needs to be bug-to-bug correct to use the trained weights properly. That's why the reference implementations are essential because it is from the original model trainers so you have the best "bet" on matching the training code properly.

To give you some concrete example of bugs we needs to maintain:

1. In SDXL, they use OpenClipG for text encoding, but wrongfully uses 0 as padding tokens (corresponding to symbol "!") whereas even for OpenClipG its own training, the endoftext token was used as padding token. However, if you switching SDXL to use endoftext token as padding token, due to training / inference divergence, you get subpar generated images.

2. In FLUX, we mainly use T5 as text encoder. However, T5 usually used as encoder with mask to exactly the same input length, to avoid extended impact of padding tokens. In FLUX, we don't apply mask for T5 text encoding, hence intuitively causing padding token to take more effect than it should. Again, "fixing" this bug without retraining you will get subpar generated images.

There are many examples like this, some are easier to fix some are not (HiDream uses a different ODE solver that is different than what we usually do for rectified flow, hence you need to negate its prediction to be compatible with existing samplers, but this is "easier to fix").

TL;DR: Yes, there are bugs in software, but we better to maintain bug-to-bug compatibility than trying to "fix" it, hence highlight the importance of a "done" reference implementation, rather than a usual "active" implementations in software industry otherwise.

(I maintain the most complete reimplementation of SoTA media generation models in Swift: https://github.com/drawthingsai/draw-things-community/tree/m.... So I tend to think that I know one or two about "reimplementation from scratch".)

  • doctorpangloss 2 days ago

    I think if you read the issue carefully you would understand that the CLIP implementation in transformers and as published by OpenAI is wrong and does not match their trained model code; and that doing the fix I suggest, empirically for me and in theory, improves results.

42lux 2 days ago

You can disable clip l on flux without a loss in quality. You are also making an elephant out of a fly. CLIP is used everywhere.

  • doctorpangloss 2 days ago

    Consider another interpretation: CLIP L in Flux can be disabled without a loss in quality because the way it is used is buggy!

    • 42lux a day ago

      oh lord.

      • doctorpangloss a day ago

        The truth is that the CLIP conditioning in Flux works well for Dreambooth style fine tuning where tokenization bugs can be acute, but not so severe as to cause the low impact of CLIP on their dev model. It is likely more impactful on their pro / max models but only BFL could say so.

electroglyph 2 days ago

It shouldn't take a lot of effort to fix a tokenizer...

  • doctorpangloss 2 days ago

    People are a little too blinded by the insight porn of matching buggy behavior to just read and comprehend the issue. They can’t engage with the simpler and more pornographic insight porn that the reference implementations are buggy and do not match the trained artifacts.