Comment by ACCount37

Comment by ACCount37 14 hours ago

It's not a very promising direction because autoregressive LLMs still deliver better output quality per model weight, as a rule.

Now, is it possible that a model can combine advantages of both? Combine fast generation and multidirectional causality of diffusion with precision, capabilities and generalization of autoregression?

Maybe. This paper is research in that direction. So far, it's not a clear upgrade over autoregressive LLMs.

euleriancon 11 hours ago

Diffusion LMs do seem to be able to get more out of the same data. In a world where we are already training transformer based LLMs on all text available, diffusion LMs ability to continue learning on a fixed set of data may be able to outperform transformers

https://arxiv.org/abs/2511.03276

Reply View 3 replies

nbardy 10 hours ago

There’s another paper that shows you can get the same effect by training auto regression on Fill in the middle data.
So it’s more about the mask modeling objective than Diffusion.

Reply View | 2 replies
- albertzeyer 4 hours ago
  
  Which paper is that?
  
  Reply View | 0 replies
- [removed] 7 hours ago
  
  [deleted]
  
  Reply View | 0 replies

ricochet11 3 hours ago

Perhaps it’s an issue is that text often has directionality.

https://arxiv.org/abs/2401.17505

Reply View 0 replies

ilaksh 10 hours ago

4-5 times faster with minimal change in quality seems like a clear upgrade in efficiency.

Reply View 1 reply

zaptrem 10 hours ago

Latency may be better, but throughput (the thing companies care about) may be the same or worse, since every step the entire diffusion window has to be passed through the model. With AR models only the most recent token goes through, which is much more compute efficient allowing you to be memory bound. Trade off with these models is more than one token per forward pass, but idk the point where that becomes worth it (probably depends on model and diffusion window size)

Reply View | 0 replies

fragmede 9 hours ago

> still deliver better output quality per model weight, as a rule.

is it possible to quantify that and just have a linked slider for quality and speed? If I can get an answer that's 80% right in 1/10th the time, and then iterate on that who comes out ahead?

Reply View 1 reply

jrk 7 hours ago

Yes but you can also do the same thing with autoregressive models just by making them smaller. This tradeoff always exists, the question is whether the Pareto curve for diffusion models ever crosses or dominates the best autoregressive option at the same throughput (or quality).

Reply View | 0 replies