Comment by sailingparrot
Comment by sailingparrot a day ago
> the verifier is checking if the tokens are good-enough, not if they're the exact same tokens it would have selected
That's up to you, depends on how you implement it and how much you want to prioritize speed at the expense of quality, this is not an intrinsic attribute of speculative decoding. The verifier checks if the tokens predicted by the draft model are part of the top-k tokens predicted by the full size model at each steps. Set k to 1 and you will only accept perfect matches. Set k to > 1 and you will indeed start selecting "good enough" tokens, but will get faster inference.
But no matter what value you choose for k, the technique described in the article can apply and will result in faster inference at no loss when compared to a setup without this technique, with the same value of k.