okay well, there are a few things that are known to be true: (1) clip's tokenizer in diffusers, the reference source in BFL's repo, and in openai's repo, is buggy (2) many clip prompts are observed to have a low impact in the flux dev and schnell models. it is very likely to be true that (1) the tokenizer in the BFL reference source and openai's repo does not match the tokenizer used in training openai's clip or the text conditioning for any of the flux checkpoints (2) the guidance and timestep distillation play a role in weakening the role of clip (3) it is practical to fine tune clip on more image-caption pairs. if you care about fine tuning, the tokenization bugs matter. everything else is hard to prove.
okay well, there are a few things that are known to be true: (1) clip's tokenizer in diffusers, the reference source in BFL's repo, and in openai's repo, is buggy (2) many clip prompts are observed to have a low impact in the flux dev and schnell models. it is very likely to be true that (1) the tokenizer in the BFL reference source and openai's repo does not match the tokenizer used in training openai's clip or the text conditioning for any of the flux checkpoints (2) the guidance and timestep distillation play a role in weakening the role of clip (3) it is practical to fine tune clip on more image-caption pairs. if you care about fine tuning, the tokenization bugs matter. everything else is hard to prove.