Comment by ggamecrazy
Comment by ggamecrazy 12 hours ago
They literally can! The exact speculative method is supported on vLLM using `speculative_model="[ngram]"`[1]
1: https://docs.vllm.ai/en/latest/features/spec_decode.html#spe...
Comment by ggamecrazy 12 hours ago
They literally can! The exact speculative method is supported on vLLM using `speculative_model="[ngram]"`[1]
1: https://docs.vllm.ai/en/latest/features/spec_decode.html#spe...
Not quite. The paper uses its own N-gram rules with positive/negative/invariant weights as a rudimentary attention, and these rules are distilled from the model itself.
This, as I found out from this repo [0] linked in the Twitter thread in the documentation (which for some reason they didn't just link to directly), seems to be a regular Markov chain of context, if it even builds a stochastic matrix. See algorithm below.
[0] https://github.com/apoorvumang/prompt-lookup-decoding