Comment by wishawa

sailingparrot a day ago

I don't know anything about Together quality in general, but the specific technique discussed here (speculative decoding) has no impact on the quality of generations. So you should be able to apply it to whichever model you want, and see the advertised speedup while retaining the quality of your base model.

Reply View 9 replies

furyofantares a day ago

> the specific technique discussed here (speculative decoding) has no impact on the quality of generations
I don't see why that would be true. As I understand, the verifier is checking if the tokens are good-enough, not if they're the exact same tokens it would have selected. The predicted tokens could be consistently slightly worse, which could have a cascading effect to make the overall output a lot worse.

Reply View | 4 replies
- buildbot a day ago
  
  It can be exact or not! Depends on the kind of sampling you are doing.
  You can do exact verification, and as soon as a token mismatches you reject everything after that token from your draft. Relaxed acceptance techniques measure how wrong that mispredicted token is via some metric, and accept it if it’s close enough. So you get longer draft lengths with higher acceptance rates.
  
  Reply View | 0 replies
- sailingparrot a day ago
  
  > the verifier is checking if the tokens are good-enough, not if they're the exact same tokens it would have selected
  That's up to you, depends on how you implement it and how much you want to prioritize speed at the expense of quality, this is not an intrinsic attribute of speculative decoding. The verifier checks if the tokens predicted by the draft model are part of the top-k tokens predicted by the full size model at each steps. Set k to 1 and you will only accept perfect matches. Set k to > 1 and you will indeed start selecting "good enough" tokens, but will get faster inference.
  But no matter what value you choose for k, the technique described in the article can apply and will result in faster inference at no loss when compared to a setup without this technique, with the same value of k.
  
  Reply View | 0 replies
- gkapur a day ago
  
  Adding to the prior comments as my intuition matched yours, there’s a nice Reddit thread that gives some context into how it can be faster even if you require exact matches: https://www.reddit.com/r/LocalLLaMA/s/ARxHLqRjdM
  The TLDR/key (from my understanding) is that verifying N tokens can be faster than generating N tokens.
  
  Reply View | 1 reply
  
  sailingparrot a day ago
  
  > The TLDR/key (from my understanding) is that verifying N tokens can be faster than generating N tokens.
  Yes. This is because to generate token n+1 you need token n etc. So generating from scratch is a sequential (thus slow) process. When we verify tokens, we can, for each token, use all preceding tokens as input and check that the output token matches the expectation. But since the full sequence we want to verify already exist, we can do it in parallel for each token we want to verify and not sequentially.
  This is why training transformer models is much faster than RNN, we do the same thing during training, it's just that the sequence we compare to is the ground truth and not coming from another model.
  
  Reply View | 0 replies
wishawa a day ago

I didn't know this! I've always thought speculative decoding was "if p(draft_token) > threshold, use it". You made me go read how it actually works and it's pretty neat!
That said, I still think some providers are cheating. Please correct me if the test below is flawed.
I generated texts at temperature = 0 vs temperature = 2. At high temperature, the distributions effectively become flatter, meaning the difference between real and draft effective distributions (the D_LK used in theorem 3.5 of 2211.17192) becomes smaller. When T=2, the model speaks complete gibberish, so the effective distribution must be pretty flat. This should mean fewer rejections --> a lot faster speculative decoding. Yet, I see no increase in throughput at all...

Reply View | 3 replies
- sailingparrot a day ago
  
  Not sure exactly what setup you are running, in theory yes, higher temperature for both model means higher chance of overlap and thus less rejections -> faster sampling (but worse quality overall).
  However, if you have higher temperature but still are operating under a top-k sampling where k is small, not sure it's going to translate to any noticeable difference, since this will make your actual distributions very much non-uniform.
  
  Reply View | 2 replies
  
  wishawa a day ago
  
  This is with Together's API via OpenRouter, running DeepSeek V3 0324 and Kimi K2 0905.
  I didn't set a top-k. So it seems like Together must be doing something weird in their speculative decoding implementation.
  
  Reply View | 1 reply
  
  sailingparrot a day ago
  
  Oh in that case there is definitely a top-k or top-p behind the scene, it might just not be exposed to the user as a param they can change through their API. I haven’t heard of anyone running a LLM in prod with actual pure sampling
  
  Reply View | 0 replies

rfoo a day ago

If you compare "schema validation error count" plus "Count of Finish Reason others" then SiliconFlow and Infinigence is in the same bucket too. Maybe their API layer detected incorrect tool call and set finish reason to something else?

IMO this likely is what you get from running the model correctly as-is (i.e. using the same weight and activation dtype), so Together is not bad.

Moonshot AI themselves and Groq likely uses some sampler tricks to eliminate schema validation errors.

So really the only thing this shows is: Nebius, Chutes, AtlasCloud could be running something else (for example further quantized model). Or bugs.

Reply View 2 replies

wishawa a day ago

Fair point. If Moonshot is holding back the true weights or inference techniques that affect correctness, then providers including Together should call them out on that. I for one would stop using Kimi if that is the case.
Anyway, Novita is doing significantly better on the vendor verifier chart than Together, so the low quality must be partially Together's fault at least.

Reply View | 1 reply
- rfoo a day ago
  
  I don't think it's weight being different or special inference techniques, more like they are not able to train the model to follow tool schema perfectly yet, and both Moonshot and Groq decided to use something like https://github.com/noamgat/lm-format-enforcer to make sure at least the output format is correct.
  
  Reply View | 0 replies