Comment by f_devd

Comment by f_devd 4 days ago

Pretty interesting architecture, seems very easy to debug, but as a downside you effectively discard K-1 computations at each layer since it's using a sampler rather than a MoE-style router.

The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.

yorwba 4 days ago

Since the sampling probability is 1/K independent of the input, you don't need to compute K different intermediate outputs at each layer during inference, you can instead decide ahead of time which of the outputs you want to use and only compute that one.

(This is mentioned in Q1 in the "Common Questions About DDN" section at the bottom.)

Reply View 2 replies

crondee 4 days ago

you dont get to do that for conditional generation though. When we have a target then we have to generate multiple, pick closest to target, and discard the rest.

Reply View | 0 replies
kevmo314 4 days ago

This is a very clever insight, nice work!

Reply View | 0 replies

ActivePattern 4 days ago

I don't think you've understood the paper.

- There are no experts. The outputs are approximating random samples from the distribution.

- There is no latent diffusion going on. It's using convolutions similar to a GAN.

- At inference time, you select ahead-of-time the sample index, so you don't discard any computations.

Reply View 2 replies

diyer22 4 days ago

I agree with @ActivePattern and thank you for your help in answering.
Supplement for @f_devd:
During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).

Reply View | 0 replies
f_devd 3 days ago

You are probably right, although it's not similar to a GAN at all, it is significantly more like diffusion (although maybe not latent, the main reason I assumed so is because the "features" are passed-through but these can just be the image).
The ahead-of-time sampling doesn't make much sense to me mechanically, and isn't really mentioned much. But I will hold my judgement for future versions since the FID performance of this first iteration is still not that great.

Reply View | 0 replies