Comment by pona-a
I wonder if these N-gram reduced models, augmented with confidence measures, can act as a very fast speculative decoder. Or maybe the sheer number of explicit rules unfolded from the compressed latent representation will make it impractical.
They literally can! The exact speculative method is supported on vLLM using `speculative_model="[ngram]"`[1]
1: https://docs.vllm.ai/en/latest/features/spec_decode.html#spe...