Comment by washadjeffmad

Didn't mean to wind you up! Totally wasn't my intent. Looking over your site, I feel like my point is pretty strongly reflected by your work, though.

While models have been trained to deliver high-level impressions (with increasing attention to detailed problem domains), one-shot control is still relatively poor, and they lack the fundamental skill of a trained artist. There are chasms between what you think you're prompting, what the text encoder understands, and how the model interprets that input, with the resulting effect of a professional musician intentionally playing badly... hands not excepted.

For instance, in "Mermaid Disciplinary Committee" on your site, every hand has a deformity or finger count inconsistency. In "Spheron", the hands have no variation and suffer from cross-subject cloning (even 4o - look at the shield-carrier).

That's what I meant about creativity and being specific. Try prompting for three people holding up certain fingers on one or both hands. Start with the index, progress to pinky. Ask it to show you a hand gripping things, rotated in different orientations. Prompt for a hand with 3 fingers, then 6 fingers, then no fingers. Ask for gang signs or shadow puppets, pinching something, with fewer or extra digits. The illusion breaks down quickly.

This is a space I'm working in, retraining text encoders and diffusion models to understand the same things first year arts students learn. With how limited and poisoned most models are, it's been a huge effort.