Comment by f_devd
Pretty interesting architecture, seems very easy to debug, but as a downside you effectively discard K-1 computations at each layer since it's using a sampler rather than a MoE-style router.
The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.
Since the sampling probability is 1/K independent of the input, you don't need to compute K different intermediate outputs at each layer during inference, you can instead decide ahead of time which of the outputs you want to use and only compute that one.
(This is mentioned in Q1 in the "Common Questions About DDN" section at the bottom.)