Comment by gundmc

Comment by gundmc 15 days ago

Well, their huge GPU clusters have "insane VRAM". Once you can actually load the model without offloading, inference isn't all that computationally expensive for the most part.