Comment by yegle

Comment by yegle a day ago

Looking at text to video examples (https://starflow-v.github.io/#text-to-video) I'm not impressed. Those gave me the feeling of the early Will Smith noodles videos.

Did I miss anything?

M4v3R a day ago

These are ~2 years behind state of the art from the looks of it. Still cool that they're releasing anything that's open for researchers to play with, but it's nothing groundbreaking.

Reply View 6 replies

tomthe a day ago

No, it is not as good as Veo, but better than Grok, I would say. Definitely better than what was available 2 years ago. And it is only a 7B research model!

Reply View | 0 replies
Mashimo a day ago

But 7b is rather small no? Are other open weight video models also this small? Can this run on a single consumer card?

Reply View | 3 replies
- dragonwriter a day ago
  
  > But 7b is rather small no?
  Sure, its smallish.
  > Are other open weight video models also this small?
  Apples models are weights-available not open weights, and yes, WAN 2.1, as well as the 14B models, also has 1.3B models; WAN 2.2, as well as the 14B models, also has a 5B model (the WAN 2.2 VAE used by Starflow-V is specifically the one used with the 5B model.) and because the WAN models are largely actually open weights models (Apache 2.0 licensed) there are lots of downstream open-licensed derivatives.
  > Can this run on a single consumer card?
  Modern model runtimes like ComfyUI can run models that do not fit in VRAM on a single consumer card by swapping model layers between RAM and VRAM as needed; models bigger than this can run on single consumer cards.
  
  Reply View | 0 replies
- Maxious a day ago
  
  Wan 2.2: "This generation was run on an RTX 3060 (12 GB VRAM) and took 900 seconds to complete at 840 × 420 resolution, producing 81 frames." https://www.nextdiffusion.ai/tutorials/how-to-run-wan22-imag...
  
  Reply View | 0 replies
- jjfoooo4 19 hours ago
  
  My guess is that they will lean towards smaller models, and try to provide the best experience for running inference on device
  
  Reply View | 0 replies
tdesilva 20 hours ago

The interesting part is they chose to go with a normalizing flow approach, rather than the industry standard diffusion model approach. Not sure why they chose this direction as I haven’t read the paper yet.

Reply View | 0 replies

jfoster a day ago

I think you need to go back and rewatch Will Smith eating spaghetti. These examples are far from perfect and probably not the best model right now, but they're far better than you're giving credit for.

As far as I know, this might be the most advanced text-to-video model that has been released? I'm not sure whether the license will qualify as open enough in everyone's eyes, though.

Reply View 0 replies

manmal a day ago

I wanted to write exactly the same thing, this reminded me of the Will Smith noodles. The juice glass keeps filling up after the liquid stopped pouring in.

Reply View 0 replies