Comment by gundmc
Well, their huge GPU clusters have "insane VRAM". Once you can actually load the model without offloading, inference isn't all that computationally expensive for the most part.
Well, their huge GPU clusters have "insane VRAM". Once you can actually load the model without offloading, inference isn't all that computationally expensive for the most part.