Comment by gkapur
Adding to the prior comments as my intuition matched yours, there’s a nice Reddit thread that gives some context into how it can be faster even if you require exact matches: https://www.reddit.com/r/LocalLLaMA/s/ARxHLqRjdM
The TLDR/key (from my understanding) is that verifying N tokens can be faster than generating N tokens.
> The TLDR/key (from my understanding) is that verifying N tokens can be faster than generating N tokens.
Yes. This is because to generate token n+1 you need token n etc. So generating from scratch is a sequential (thus slow) process. When we verify tokens, we can, for each token, use all preceding tokens as input and check that the output token matches the expectation. But since the full sequence we want to verify already exist, we can do it in parallel for each token we want to verify and not sequentially.
This is why training transformer models is much faster than RNN, we do the same thing during training, it's just that the sequence we compare to is the ground truth and not coming from another model.