Comment by ActivePattern

Comment by ActivePattern 4 days ago

2 replies

View on Hacker News

I don't think you've understood the paper.

- There are no experts. The outputs are approximating random samples from the distribution.

- There is no latent diffusion going on. It's using convolutions similar to a GAN.

- At inference time, you select ahead-of-time the sample index, so you don't discard any computations.

diyer22 4 days ago

I agree with @ActivePattern and thank you for your help in answering.

Supplement for @f_devd:

During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).

Reply View 0 replies

f_devd 3 days ago

You are probably right, although it's not similar to a GAN at all, it is significantly more like diffusion (although maybe not latent, the main reason I assumed so is because the "features" are passed-through but these can just be the image).

The ahead-of-time sampling doesn't make much sense to me mechanically, and isn't really mentioned much. But I will hold my judgement for future versions since the FID performance of this first iteration is still not that great.

Reply View 0 replies