Comment by furyofantares

Comment by furyofantares a day ago

4 replies

> the specific technique discussed here (speculative decoding) has no impact on the quality of generations

I don't see why that would be true. As I understand, the verifier is checking if the tokens are good-enough, not if they're the exact same tokens it would have selected. The predicted tokens could be consistently slightly worse, which could have a cascading effect to make the overall output a lot worse.

buildbot a day ago

It can be exact or not! Depends on the kind of sampling you are doing.

You can do exact verification, and as soon as a token mismatches you reject everything after that token from your draft. Relaxed acceptance techniques measure how wrong that mispredicted token is via some metric, and accept it if it’s close enough. So you get longer draft lengths with higher acceptance rates.

sailingparrot a day ago

> the verifier is checking if the tokens are good-enough, not if they're the exact same tokens it would have selected

That's up to you, depends on how you implement it and how much you want to prioritize speed at the expense of quality, this is not an intrinsic attribute of speculative decoding. The verifier checks if the tokens predicted by the draft model are part of the top-k tokens predicted by the full size model at each steps. Set k to 1 and you will only accept perfect matches. Set k to > 1 and you will indeed start selecting "good enough" tokens, but will get faster inference.

But no matter what value you choose for k, the technique described in the article can apply and will result in faster inference at no loss when compared to a setup without this technique, with the same value of k.

gkapur a day ago

Adding to the prior comments as my intuition matched yours, there’s a nice Reddit thread that gives some context into how it can be faster even if you require exact matches: https://www.reddit.com/r/LocalLLaMA/s/ARxHLqRjdM

The TLDR/key (from my understanding) is that verifying N tokens can be faster than generating N tokens.

  • sailingparrot a day ago

    > The TLDR/key (from my understanding) is that verifying N tokens can be faster than generating N tokens.

    Yes. This is because to generate token n+1 you need token n etc. So generating from scratch is a sequential (thus slow) process. When we verify tokens, we can, for each token, use all preceding tokens as input and check that the output token matches the expectation. But since the full sequence we want to verify already exist, we can do it in parallel for each token we want to verify and not sequentially.

    This is why training transformer models is much faster than RNN, we do the same thing during training, it's just that the sequence we compare to is the ground truth and not coming from another model.