Comment by thethirdone

In this paper both the diffusion and the auto-regressive models are transformers with O(n^2) performance for long sequences. They share the "Exact KV Cache" for committed tokens.

Diffusion just allows you to spend more compute at the same time so you don't redundantly access the same memory. It can only improve speed beyond the memory bandwidth limit by committing multiple tokens each pass.

Other linear models like Mamba get away from O(n^2) effects, but type of neural architecture is orthogonal to the method of generation.