Comment by wishawa
I didn't know this! I've always thought speculative decoding was "if p(draft_token) > threshold, use it". You made me go read how it actually works and it's pretty neat!
That said, I still think some providers are cheating. Please correct me if the test below is flawed.
I generated texts at temperature = 0 vs temperature = 2. At high temperature, the distributions effectively become flatter, meaning the difference between real and draft effective distributions (the D_LK used in theorem 3.5 of 2211.17192) becomes smaller. When T=2, the model speaks complete gibberish, so the effective distribution must be pretty flat. This should mean fewer rejections --> a lot faster speculative decoding. Yet, I see no increase in throughput at all...
Not sure exactly what setup you are running, in theory yes, higher temperature for both model means higher chance of overlap and thus less rejections -> faster sampling (but worse quality overall).
However, if you have higher temperature but still are operating under a top-k sampling where k is small, not sure it's going to translate to any noticeable difference, since this will make your actual distributions very much non-uniform.