Comment by ActivePattern
Comment by ActivePattern 4 days ago
I don't think you've understood the paper.
- There are no experts. The outputs are approximating random samples from the distribution.
- There is no latent diffusion going on. It's using convolutions similar to a GAN.
- At inference time, you select ahead-of-time the sample index, so you don't discard any computations.
I agree with @ActivePattern and thank you for your help in answering.
Supplement for @f_devd:
During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).