Comment by claiir

It’s not just you. The speedup is an artefact of the CFG (Classifier-Free Guidance) the model uses. The other problem is the speedup isn’t constant—it actually accelerates as the generation progresses. The Parakeet paper [1] (which OP lifted their model architecture almost directly from [2]) gives a fairly robust treatment to the matter:

> When we apply CFG to Parakeet sampling, quality is significantly improved. However, on inspecting generations, there tends to be a dramatic speed-up over the duration of the sample (i.e. the rate of speaking increases significantly over time). Our intuition for this problem is as follows: Say that is our model is (at some level) predicting phonemes and the ground truth distribution for the next phoneme occuring is 25% at a given timestep. Our conditional model may predict 20%, but because our uncondtional model cannot see the text transcription, its prediction for the correct next phoneme will be much lower, say 5%. With a reasonable level of CFG, because [the logit delta] will be large for the correct next phoneme, we’ll obtain a much higher final probability, say 50%, which biases our generation towards faster speech. [emphasis mine]

Parakeet details a solution to this, though this was not adopted (yet?) by Dia:

> To address this, we introduce CFG-filter, a modification to CFG that mitigates the speed drift. The idea is to first apply the CFG calculation to obtain a new set of logits as before, but rather than use these logits to sample, we use these logits to obtain a top-k mask to apply to our original conditional logits. Intuitively, this serves to constrict the space of possible “phonemes” to text-aligned phonemes without heavily biasing the relative probabilities of these phonemes (or for example, start next word vs pause more). [emphasis mine]

The paper contains audio samples with ablations you can listen to.

[1]: https://jordandarefsky.com/blog/2024/parakeet/#classifier-fr...

[2]: https://news.ycombinator.com/item?id=43758686