Comment by furyofantares
Comment by furyofantares a day ago
> the specific technique discussed here (speculative decoding) has no impact on the quality of generations
I don't see why that would be true. As I understand, the verifier is checking if the tokens are good-enough, not if they're the exact same tokens it would have selected. The predicted tokens could be consistently slightly worse, which could have a cascading effect to make the overall output a lot worse.
It can be exact or not! Depends on the kind of sampling you are doing.
You can do exact verification, and as soon as a token mismatches you reject everything after that token from your draft. Relaxed acceptance techniques measure how wrong that mispredicted token is via some metric, and accept it if it’s close enough. So you get longer draft lengths with higher acceptance rates.