Comment by gdiamos
Diffusion is favored by current GPUs .
Over time we seem to have a tendency to build models that are well matched to our machines
Diffusion is favored by current GPUs .
Over time we seem to have a tendency to build models that are well matched to our machines
Not really. The problem is that transformer LLMs are autoregressive and are O(n^2) for self attention and also require insane amounts of bandwidth to “page in” the weights into the relevant compute parts. TPUs do this faster than a CPU like any accelerator but fundamentally this is a challenge. There are attempts to build hardware where the weights are burned into the silicon but that carries other meaningful downsides.
But op is referring to the fact that diffusion is friendlier on both bandwidth and not needing large n^2 compute blocks in the critical path.
In this paper both the diffusion and the auto-regressive models are transformers with O(n^2) performance for long sequences. They share the "Exact KV Cache" for committed tokens.
Diffusion just allows you to spend more compute at the same time so you don't redundantly access the same memory. It can only improve speed beyond the memory bandwidth limit by committing multiple tokens each pass.
Other linear models like Mamba get away from O(n^2) effects, but type of neural architecture is orthogonal to the method of generation.
Are TPUs different?