Apple Releases Open Weights Video Model

(starflow-v.github.io)

439 points by vessenes a day ago

Apple has a video understanding model too. I can't wait to find out what accessibility stuff they'll do with the models. As a blind person, AI has changed my life.

Reply View 112 replies

densh a day ago

> As a blind person, AI has changed my life.
Something one doesn't see in news headlines. Happy to see this comment.

Reply View | 28 replies
- kkylin a day ago
  
  Like many others, I too would very much like to hear about this.
  I taught our entry-level calculus course a few years ago and had two blind students in the class. The technology available for supporting them was abysmal then -- the toolchain for typesetting math for screen readers was unreliable (and anyway very slow), for braille was non-existent, and translating figures into braille involved sending material out to a vendor and waiting weeks. I would love to hear how we may better support our students in subjects like math, chemistry, physics, etc, that depend so much on visualization.
  
  Reply View | 5 replies
  
  VogonPoetry 9 hours ago
  
  I did a maths undergrad degree and the way my blind, mostly deaf friend and I communicated was using a stylized version of TeX markup. I typed on a terminal and he read / wrote on his braille terminal. It worked really well.
  
  Reply View | 3 replies
  
  WillAdams 21 hours ago
  
  For a physical view on this see:
  https://www.reddit.com/r/openscad/comments/1p6iv5y/christmas...
  The creator, https://www.reddit.com/user/Mrblindguardian/ has asked for help a few times in the past (I provided feedback when I could), but hasn't needed to as often of late, presumably due to using one or more LLMs.
  
  Reply View | 0 replies
- K0balt 2 hours ago
  
  I must be wrong, but can’t help but harbor a mild suspicion that your use of sight metaphors is not coincidental.
  
  Reply View | 0 replies
- tippa123 a day ago
  
  +1 and I would be curious to read and learn more about it.
  
  Reply View | 10 replies
  
  swores a day ago
  
  A blind comedian / TV personality in the UK has just done a TV show on this subject - I haven't seen it, but here's a recent article about it: https://www.theguardian.com/tv-and-radio/2025/nov/23/chris-m...
  
  Reply View | 4 replies
  
  joedevon a day ago
  
  If you want to see more on this topic, check out (google) the podcast I co-host called Accessibility and Gen. AI.
  
  Reply View | 2 replies
  
  chrisweekly a day ago
  
  Same! @devinprater, have you written about your experiences? You have an eager audience...
  
  Reply View | 1 reply
  
  devinprater 4 hours ago
  
  I suppose I should write about them. A good few will be about issues with the mobile apps and websites for AI, like Claude not even letting me know a response is available to read, let alone sending it to the screen reader to be read. It's a mess, but if we blind people want it, we have to push through inaccessibility to get it.
  
  Reply View | 0 replies
- badmonster a day ago
  
  What other accessibility features do you wish existed in video AI models? Real-time vs post-processing?
  
  Reply View | 3 replies
  
  devinprater 16 hours ago
  
  Mainly realtime processing. I play video games, and would love to play something like Legend of Zelda and just have the AI going, then ask it "read the menu options as I move between them," and it would speak each menu option as the cursor moves to it. Or when navigating a 3D environment, ask it to describe the surroundings, then ask it to tell me how to get to a place or object, then it guide me to it. That could be useful in real-world scenarios too.
  
  Reply View | 2 replies
- Rover222 19 hours ago
  
  `Something one doesn't see` - no pun intended
  
  Reply View | 0 replies
- WarcrimeActual 18 hours ago
  
  I have to believe you used the word see twice ironically.
  
  Reply View | 0 replies
- fguerraz a day ago
  
  > Something one doesn't see in news headlines.
  I hope this wasn't a terrible pun
  
  Reply View | 3 replies
  
  densh a day ago
  
  No pun intended but it's indeed an unfortunate choice of words on my part.
  
  Reply View | 1 reply
  
  47282847 a day ago
  
  My blind friends have gotten used to it and hear/receive it not as a literal “see“ any more. They would not feel offended by your usage.
  
  Reply View | 0 replies
  
  devinprater 16 hours ago
  
  Nah, best pun ever!
  
  Reply View | 0 replies
GeekyBear a day ago

One cool feature they added for deaf parents a few years ago was a notification when it detects a baby crying.

Reply View | 9 replies
- SatvikBeri 21 hours ago
  
  My wife is deaf, and we had one kid in 2023 and twins in 2025. There's been a noticeable improvement baby cry detection! In 2023, the best we could find was a specialized device that cost over $1,000 and has all sorts of flakiness/issues. Today, the built-in detection on her (android) phone + watch is better than that device, and a lot more convenient.
  
  Reply View | 0 replies
- Damogran6 21 hours ago
  
  I also got notification on my apple watch, while being away from the house, that the homepod mini heard our fire alarm going off.
  A call home let us know that our son had set it off learning to reverse-sear his steak.
  
  Reply View | 2 replies
  
  kstrauser 19 hours ago
  
  I live across the street from a fire station. Thank for you for diligence, little HomePod Mini, but I'm turning your notifications off now.
  
  Reply View | 0 replies
  
  brandonb 20 hours ago
  
  If the fire alarm didn't go off, you didn't sear hard enough. :)
  
  Reply View | 0 replies
- embedding-shape a day ago
  
  Is that something you actually need AI for though? A device with a sound sensor and something that shines/vibrate a remote device when it detects sound above some threshold would be cheaper, faster detection, more reliable, easier to maintain, and more.
  
  Reply View | 4 replies
  
  evilduck a day ago
  
  But your solution costs money in addition to the phone they already own for other purposes. And multiple things can make loud noises in your environment besides babies; differentiating between a police siren going by outside and your baby crying is useful, especially if the baby slept through the siren.
  The same arguments were said for blind people and the multitude of one-off devices that smartphones replaced, OCR to TTS, color detection, object detection in photos/camera feeds, detecting what denomination US bills are, analyzing what's on screen semantically vs what was provided as accessible text (if any was at all), etc. Sure, services for the blind would come by and help arrange outfits for people, and audiobook narrators or braille translator services existed, and standalone devices to detect money denominations were sold, but a phone can just do all of that now for much cheaper.
  All of these accessibility AI/ML features run on-device, so the knee-jerk anti-AI crowd's chief complaints are mostly baseless anyways. And for the blind and the deaf, carrying all the potential extra devices with you everywhere is burdensome. The smartphone is a minimal and common social and physical burden.
  
  Reply View | 0 replies
  
  Aurornis 19 hours ago
  
  > more reliable
  I've worked on some audio/video alert systems. Basic threshold detectors produce a lot of false positives. It's common for parents to put white noise machines in the room to help the baby sleep. When you have a noise generating machine in the same room, you need more sophisticated detection.
  False positives are the fastest way to frustrate users.
  
  Reply View | 0 replies
  
  jfindper a day ago
  
  >Is that something you actually need AI for though?
  Need? Probably not. I bet it helps though (false positives, etc.)
  >would be cheaper, faster detection, more reliable, easier to maintain, and more.
  Cheaper than the phone I already own? Easier to maintain than the phone that I don't need to do maintenance on?
  From a fun hacking perspective, a different sensor & device is cool. But I don't think it's any of the things you mentioned for the majority of people.
  
  Reply View | 0 replies
  
  doug_durham 20 hours ago
  
  You are talking about a device of smart phone complexity. You need enough compute power to run a model that can distinguish noises. You need a TCP/IP stack and a wireless radio to communicate the information. At that point you have a smart phone. A simple sound threshold device would have too many false positives/negatives to be useful.
  
  Reply View | 0 replies
whatsupdog 21 hours ago

> As a blind person, AI has changed my life.
I know this is a low quality comment, but I'm genuinely happy for you.

Reply View | 0 replies
phyzix5761 a day ago

Can you share some ways AI has changed your life?

Reply View | 47 replies
- darkwater a day ago
  
  I guess that auto-generated audio descriptions for (almost?) any video you want is a very, very nice feature for a blind person.
  
  Reply View | 18 replies
  
  tippa123 a day ago
  
  My two cents, this seems like a case where it’s better to wait for the person’s response instead of guessing.
  
  Reply View | 16 replies
  
  baq a day ago
  
  guessing that being able to hear a description of what the camera is seeing (basically a special case of a video) in any circumstances is indeed life changing if you're blind...? take a picture through the window and ask what's the commotion? door closed outside that's normally open - take a picture, tell me if there's a sign on it? etc.
  
  Reply View | 0 replies
- gostsamo a day ago
  
  Not the gp, but currently reading a web novel with a card game where the author didn't include alt text in the card images. I contacted them about it and they started, but in the meantime ai was a big help. all kinds of other images on the internet as well when they are significant to understanding the surrounding text. better search experience when Google, DDG, and the like make finding answers difficult. I might use smart glasses for better outdoor orientation, though a good solution might take some time. phone camera plus ai is also situationally useful.
  
  Reply View | 26 replies
  
  dzhiurgis a day ago
  
  As a (web app) developer I never quite sure what to put in alt. Figured you might have some advice here?
  
  Reply View | 25 replies
- devinprater 16 hours ago
  
  Image descriptions. TalkBack on Android has it built in and uses Gemini. VoiceOver still uses some older, less accurate, and far less descriptive ML model, but we can share images to Seeing AI or Be My Eyes and such and get a description.
  Video descriptions, through PiccyBot, have made watching more visual videos or videos where things happen that don't make sense without visuals much easier. Of course, it'd be much better if YouTube incorporated audio description through AI the same way they do captions, but that may happen in a good 2 years or so. I'm not holding my breath. Google as a whole is hard to get accessibility out of more than the bare minimum.
  Looking up information like restaurant menus. Yes it can make things up, but worst-case, the waiter says they don't have that.
  
  Reply View | 0 replies
javcasas a day ago

Finally good news about the AI doing something good for the people.

Reply View | 15 replies
- p1esk a day ago
  
  I’m not blind and AI has been great for me too.
  
  Reply View | 3 replies
  
  majkinetor a day ago
  
  [flagged]
  
  Reply View | 2 replies
- Workaccount2 a day ago
  
  People need to understand that a lot of angst around AI comes from AI enabling people to do things that they formally needed to go through gatekeepers for. The angst is coming from the gatekeepers.
  AI has been a boon for me and my non-tech job. I can pump out bespoke apps all day without having to get bent on $5000/yr/usr engineering software packages. I have a website for my side business that looks and functions professionally and was done with a $20 monthly AI subscription instead of a $2000 contractor.
  
  Reply View | 10 replies
  
  BeFlatXIII 15 hours ago
  
  AI is divine retribution for artists being really annoying on Twitter.
  
  Reply View | 0 replies
  
  MyFirstSass 21 hours ago
  
  I highly doubt "pumping out bespoke apps all day" is possible yet besides 100% boilerplate, and when possible then no good for any other purpose than enshittifiying the web, and at that point not profitable because everyone can do it.
  I use AI daily as a senior coder for search and docs, and when used for prototyping you still need to be a senior coder to go from say 60% boilerplate to 100% finished app/site/whatever unless it's incredibly simple.
  
  Reply View | 8 replies
robbomacrae 18 hours ago

Hi Devin and other folks, I'm looking for software developers who are blind or hard of sight as there is a tool I'm building that I think might be of interest to them (it's free and open source). If you or anyone you know is interested in trying it please get in touch through my email.

Reply View | 0 replies
basilgohar 19 hours ago

I'm only commenting because I absolutely love this thread. It's an insight into something I think most of us are quite (I'm going to say it...) blind to in our normal experiences with daily life, and I find immense value in removing my ignorance about such things.

Reply View | 0 replies
andy_ppp 21 hours ago

I wonder if there's anything that can help blind people to navigate the world more easily - I guess in the future AR Glasses won't just be for the sighted but allow people without vision to be helped considerably. It really is both amazing and terrifying the future we're heading towards.

Reply View | 4 replies
- asadotzler 15 hours ago
  
  AURA Vision for blind and low vision people has been doing this for years. Be My Eyes has been doing this for years without AI. Meta Ray-Bans can do this. There's nothing new coming soon that hasn't already been available for a while, only refinements.
  
  Reply View | 1 reply
  
  astrange 5 hours ago
  
  This is an area where a difference in degree becomes a difference in kind quite easily. If an AI is telling you what it's looking at, it suddenly means a lot more once it… knows what it's looking at.
  
  Reply View | 0 replies
- xnx 18 hours ago
  
  https://play.google.com/store/apps/details?id=com.google.and...
  
  Reply View | 0 replies
- shagie 20 hours ago
  
  From a couple years ago...
  https://www.microsoft.com/en-us/garage/wall-of-fame/seeing-a...
  https://youtu.be/R2mC-NUAmMk
  https://youtu.be/DybczED-GKE
  ... and that was 10 years ago. I'm curious for what it could do now.
  
  Reply View | 0 replies
kruxigt a day ago

[dead]

Reply View | 0 replies

RobotToaster a day ago

The license[0] seems quite restrictive, limiting it's use to non commercial research. It doesn't meet the open source definition so it's more appropriate to call it weights available.

[0]https://github.com/apple/ml-starflow/blob/main/LICENSE_MODEL

Reply View 4 replies

limagnolia 17 hours ago

They haven't even released the weights yet...
As for the license, happily, Model Weights are the product of machine output and not creative works, so not copyrightable under US law. Might depend on where you are from, but I would have no problem using Model Weights however I want to and ignoring pointless licenses.

Reply View | 2 replies
- pabs3 4 hours ago
  
  The output of a compiler is copyrightable, why aren't models similarly copyrightable?
  
  Reply View | 0 replies
- loufe 16 hours ago
  
  The weights for the text-->image model are already on Huggingface, FWIW.
  
  Reply View | 0 replies
[removed] 17 hours ago

[deleted]

Reply View | 0 replies

yegle a day ago

Looking at text to video examples (https://starflow-v.github.io/#text-to-video) I'm not impressed. Those gave me the feeling of the early Will Smith noodles videos.

Did I miss anything?

Reply View 9 replies

M4v3R a day ago

These are ~2 years behind state of the art from the looks of it. Still cool that they're releasing anything that's open for researchers to play with, but it's nothing groundbreaking.

Reply View | 6 replies
- tomthe a day ago
  
  No, it is not as good as Veo, but better than Grok, I would say. Definitely better than what was available 2 years ago. And it is only a 7B research model!
  
  Reply View | 0 replies
- Mashimo a day ago
  
  But 7b is rather small no? Are other open weight video models also this small? Can this run on a single consumer card?
  
  Reply View | 3 replies
  
  dragonwriter a day ago
  
  > But 7b is rather small no?
  Sure, its smallish.
  > Are other open weight video models also this small?
  Apples models are weights-available not open weights, and yes, WAN 2.1, as well as the 14B models, also has 1.3B models; WAN 2.2, as well as the 14B models, also has a 5B model (the WAN 2.2 VAE used by Starflow-V is specifically the one used with the 5B model.) and because the WAN models are largely actually open weights models (Apache 2.0 licensed) there are lots of downstream open-licensed derivatives.
  > Can this run on a single consumer card?
  Modern model runtimes like ComfyUI can run models that do not fit in VRAM on a single consumer card by swapping model layers between RAM and VRAM as needed; models bigger than this can run on single consumer cards.
  
  Reply View | 0 replies
  
  Maxious a day ago
  
  Wan 2.2: "This generation was run on an RTX 3060 (12 GB VRAM) and took 900 seconds to complete at 840 × 420 resolution, producing 81 frames." https://www.nextdiffusion.ai/tutorials/how-to-run-wan22-imag...
  
  Reply View | 0 replies
  
  jjfoooo4 19 hours ago
  
  My guess is that they will lean towards smaller models, and try to provide the best experience for running inference on device
  
  Reply View | 0 replies
- tdesilva 19 hours ago
  
  The interesting part is they chose to go with a normalizing flow approach, rather than the industry standard diffusion model approach. Not sure why they chose this direction as I haven’t read the paper yet.
  
  Reply View | 0 replies
jfoster a day ago

I think you need to go back and rewatch Will Smith eating spaghetti. These examples are far from perfect and probably not the best model right now, but they're far better than you're giving credit for.
As far as I know, this might be the most advanced text-to-video model that has been released? I'm not sure whether the license will qualify as open enough in everyone's eyes, though.

Reply View | 0 replies
manmal a day ago

I wanted to write exactly the same thing, this reminded me of the Will Smith noodles. The juice glass keeps filling up after the liquid stopped pouring in.

Reply View | 0 replies

gorgoiler 21 hours ago

It’s not really relevant to this release specifically but it irks me that, in general, an “open weights model” is like an “open source machine code” version of Microsoft Windows. Yes, I guess I have open access to view the thing I am about to execute!

This Apple license is click wrap MIT with the rights, at least, to modify and redistribute the model itself. I suppose I should be grateful for that much openness, at least.

Reply View 5 replies

advisedwang 19 hours ago

Great analogy.
To extend the analogy, "closed source machine code" would be like conventional SaaS. There's an argument that shipping me a binary I can freely use is at least better than only providing SaaS.

Reply View | 0 replies
satvikpendem 15 hours ago

> Yes, I guess I have open access to view the thing I am about to execute!
Better to execute locally than to execute remotely where you can't change or modify any part of the model though. Open weights at least mean you can retrain or distill it, which is not analogous to a compiled executable that you can't (generally) modify.

Reply View | 0 replies
limagnolia 17 hours ago

I think you are looking at the code license, not the model license.

Reply View | 2 replies
- Aloisius 15 hours ago
  
  No, it's the model license. There's a second license for the code.
  Of course, model weights almost certainly are not copyrightable so the license isn't enforceable anyway, at least in the US.
  The EU and the UK are a different matter since they have sui generis database rights which seemingly allows individuals to own /dev/random.
  
  Reply View | 1 reply
  
  pabs3 4 hours ago
  
  The output of a compiler is copyrightable, why aren't models similarly copyrightable?
  
  Reply View | 0 replies

vessenes a day ago

From the paper, this is a research model aimed at dealing with the runaway error common in diffusion video models - the latent space is (proposed to be) causal and therefore it should have better coherence.

For a 7b model the results look pretty good! If Apple gets a model out here that is competitive with wan or even veo I believe in my heart it will have been trained with images of the finest taste.

Reply View 0 replies

summerlight 17 hours ago

This looks interesting. This project has some novelty as a research and actually delivered a promising PoC but as a product it implies that its training was severely constrained by computing resources, which correlates well with the report that their CFO overruled CEO's decision on ML infra investment.

JG's recent departure and follow up massive reorg to get rid of AI, rumors on Tim's upcoming step down in early 2026... All of these signals indicate that those non-ML folks have won corporate politics to reduce the in-house AI efforts.

I suppose this was a part of serious efforts to deliver in-house models but the directional changes on AI strategy made them to give up. What a shame... At least the approach itself seem interesting and hope others to take a look and use it for building something useful.

Reply View 0 replies

coolspot a day ago

> STARFlow-V is trained on 96 H100 GPUs using approximately 20 million videos.

They don’t say for how long.

Reply View 1 reply

moondev 19 hours ago

Apple intelligence: trained by Nvidia GPUs on Linux.
Do the examples in the repo run inference on Mac?

Reply View | 0 replies

dymk a day ago

Title is wrong, model isn’t released yet. Title also doesn’t appear in the link - why the editorializing?

Reply View 0 replies

satvikpendem a day ago

Looks good. I wonder what use case Apple has in mind though, or I suppose this is just what the researchers themselves were interested in, perhaps due to the current zeitgeist. I'm not really sure how it works at big tech companies with regards to research, are there top down mandates?

Reply View 2 replies

ozim a day ago

I guess Apple is big in video production and animation with some ties via Pixar and Disney. Since Jobs started Pixar and it all got tied up in myriad of different ways.

Reply View | 0 replies
ivape a day ago

To add things to videos you create with your phone. TikTok and Insta will probably add this soon, but I suppose Apple is trying to provide this feature on “some level”. That means you don’t have to send your video through a social media platform first to creatively edit it (the platforms being the few tools that let you do generative video).
They should really buy Snapchat.

Reply View | 0 replies

LoganDark a day ago

> Model Release Timeline: Pretrained checkpoints will be released soon. Please check back or watch this repository for updates.

> The checkpoint files are not included in this repository due to size constraints.

So it's not actually open weights yet. Maybe eventually once they actually release the weights it will be. "Soon"

Reply View 0 replies

nothrowaways a day ago

Where do they get the video training data?

Reply View 8 replies

postalcoder a day ago

From the paper:
> Datasets. We construct a diverse and high-quality collection of video datasets to train STARFlow-V. Specifically, we leverage the high-quality subset of Panda (Chen et al., 2024b) mixed with an in-house stock video dataset, with a total number of 70M text-video pairs.

Reply View | 7 replies
- justinclift a day ago
  
  > in-house stock video dataset
  Wonder if "iCloud backups" would be counted as "stock video" there? ;)
  
  Reply View | 6 replies
  
  whywhywhywhy 18 hours ago
  
  More likely AppleTV shows
  
  Reply View | 1 reply
  
  astrange 5 hours ago
  
  Stock video means stock video.
  https://en.wikipedia.org/wiki/Stock_photography
  
  Reply View | 0 replies
  
  anon7000 a day ago
  
  I have to delete as many videos as humanly possible before backing up to avoid blowing through my iCloud storage quota so I guess I’m safe
  
  Reply View | 0 replies
  
  fragmede a day ago
  
  Turn on advanced data protection so they don't train on yours.
  
  Reply View | 2 replies

giancarlostoro a day ago

I was upset the page didnt have videos immediately available, then I realized I have to click on some of the tabs. One red flag on their github is the license looks to be their own flavor of MIT (though much closer to MS-PL).

Reply View 0 replies

[removed] a day ago

[deleted]

Reply View 0 replies

andersa a day ago

The number of video models that are worse than Wan 2.2 and can safely be ignored has increased by 1.

Reply View 3 replies

embedding-shape a day ago

To be fair, the sizes aren't comparable, and for the variant that is comparable, the results aren't that much worse.

Reply View | 1 reply
- dragonwriter 20 hours ago
  
  The samples (and this may or may not be completely fair, either set could be more cherry picked than the other, It would be interesting to see a side-by-side comparison with comparable prompts) seem significantly worse than what I’ve seen from WAN 2.1 1.3B, which is both fron the previous WAN version and is smaller, proportionally, compared to Apple’s 7B than that model itself is compared to the 28B combination of the high and low noise 14B WAN 2.2 models that are typically used together.
  But also, Starflow-V is a research model with a substandard text encoder, it doesn't have to be competitive as-is to be an interesting spur for further research on the new architecture it presents. (Though it would be nice if it had some aspect where it offered a clear improvement.)
  
  Reply View | 0 replies
wolttam 21 hours ago

This doesn’t look like it was intended to compete. The research appears interesting

Reply View | 0 replies

cubefox 20 hours ago

Interesting that this is an autoregressive ("causal") model rather than a diffusion model.

Reply View 0 replies

camillomiller a day ago

Hopefully this will make into some useful feature in the ecosystem and not contribute to having just more terrible slop. Apple has saved itself from the destruction of quality and taste that these model enabled, I hope it stays that way.

Reply View 0 replies

Invictus0 20 hours ago

Apple's got to stop running their AI group like a university lab. Get some actual products going that we can all use--you know, with a proper fucking web UI and a backend.

Reply View 1 reply

Jtsummers 18 hours ago

Personally, I'm happy that Apple is spending the time and money on research. We have products that already do what this model does, the next step is to make it either more efficient or better (closer to the prompt, more realistic, higher quality output). That requires research, not more products.

Reply View | 0 replies

Barry-Perkins a day ago

[dead]

Reply View 0 replies

ai_updates a day ago

[flagged]

Reply View 1 reply

MallocVoidstar a day ago

you don't "appreciate" anything, you're just posting LLM comments

Reply View | 0 replies

mdrzn a day ago

"VAE: WAN2.2-VAE" so it's just a Wan2.2 edit, compressed to 7B.

Reply View 5 replies

kouteiheika a day ago

This doesn't necessarily mean that it's Wan2.2. People often don't train their own VAEs and just reuse an existing one, because a VAE isn't really what's doing the image generation part.
A little bit more background for those who don't know what a VAE is (I'm simplifying here, so bear with me): it's essentially a model which turns raw RGB images into a something called a "latent space". You can think of it as a fancy "color" space, but on steroids.
There are two main reasons for this: one is to make the model which does the actual useful work more computationally efficient. VAEs usually downscale the spatial dimensions of the images they ingest, so your model now instead of having to process a 1024x1024 image needs to work on only a 256x256 image. (However they often do increase the number of channels to compensate, but I digress.)
The other reason is that, unlike raw RGB space, the latent space is actually a higher level representation of the image.
Training a VAE isn't the most interesting part of image models, and while it is tricky, it's done entirely in an unsupervised manner. You give the VAE an RGB image, have it convert it to latent space, then have it convert it back to RGB, you take a diff between the input RGB image and the output RGB image, and that's the signal you use when training them (in reality it's a little more complex, but, again, I'm simplifying here to make the explanation more clear). So it makes sense to reuse them, and concentrate on the actually interesting parts of an image generation model.

Reply View | 2 replies
- sroussey 19 hours ago
  
  Since you seem to know way more than I on the subject, can you explain the importance of video generation that is not diffusion based?
  
  Reply View | 0 replies
- mdrzn a day ago
  
  Thanks for the explanation!
  
  Reply View | 0 replies
dragonwriter a day ago

> "VAE: WAN2.2-VAE" so it's just a Wan2.2 edit
No, using the WAN 2.2 VAE does not mean it is a WAN 2.2 edit.
> compressed to 7B.
No, if it was an edit of the WAN model that uses the 2.2 VAE, it would be expanded to 7B, not compressed (the 14B models of WAN 2.2 use the WAN 2.1 VAE, the WAN 2.2 VAE is used by the 5B WAN 2.2 model.)

Reply View | 0 replies
BoredPositron a day ago

They used the VAE of WAN like many other models do. For image models you see a lot of them using the flux VAE. Which is perfectly fine, they are released as apache2 and save you time to focus on your transformers architecture...

Reply View | 0 replies

pulse7 a day ago

Reply View 0 replies