Comment by altcognito

Comment by altcognito 3 days ago

Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded.

minimaltom 3 days ago

Its not deterministic. Any individual floating point mul/add is deterministic, but in a GPU these are all happening in parallel and the accumulation is in the order they happen to complete.

When you add A then B then C, you get a different answer than C then A then B, because floating point, approximation error, subnormals etc.

Reply View 2 replies

bonoboTP 3 days ago

It can be made deterministic. It's not trivial and can slow it down a bit (not much) but there are environment variables you can set to make your GPU computations bitwise reproducible. I have done this in training models with Pytorch.

Reply View | 1 reply
- minimaltom 3 days ago
  
  There are settings to make it reproducible but they incur a non-negligible drop in performance.
  Unsurprising given they amount to explicit synchronization to make the order of operations deterministic.
  
  Reply View | 0 replies

chrisjj 3 days ago

Not deterministic. https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

Reply View 0 replies

jmalicki 3 days ago

For all practical purposes any code reliant on the output of a PRNG is non-deterministic in all but the most pedantic senses... And if the LLM temperature isn't set to 0 LLMs are sampling from a distribution.

If you're going to call a PRNG deterministic then the outcome of a complicated concurrent system with no guaranteed ordering is going to be deterministic too!

Reply View 6 replies

gmueckl 3 days ago

No, this isn't right. There are totally legitimate use cases for PRNGs as sources of random number sequences following a certain probability distribution where freezing the seed and getting reproducibility is actually required.

Reply View | 2 replies
- jmalicki 3 days ago
  
  And for a complicated concurrent system you can also replay the exact timings and orderings as well!
  
  Reply View | 1 reply
  
  gmueckl 2 days ago
  
  That's completely different from PRNGs. I don't understand why you think those things belong together.
  
  Reply View | 0 replies
bonoboTP 3 days ago

How is this related to overloading? The nondeterminism should not be a function of overloading. It should just time out or reply slower. It will only be dumber if it gets rerouted to a dumber, faster model eg quantized.

Reply View | 0 replies
joquarky 2 days ago

Temperature can't be literally zero, or it creates a divide by zero error.
When people say zero, it is shorthand for “as deterministic as this system allows”, but it's still not completely deterministic.

Reply View | 1 reply
- forgotTheLast 2 days ago
  
  Zero temp just uses argmax, which is what softmax approaches if you take the limit of T to zero anyway. So it could very well be deterministic.
  
  Reply View | 0 replies

pertymcpert 3 days ago

Floating point math isn't associative for operations that are associative in normal math.

Reply View 3 replies

measurablefunc 3 days ago

That would just add up to statistical noise instead of 10% degradation over a week.

Reply View | 2 replies
- kevin_thibedeau 3 days ago
  
  Catastrophic error accumulation can produce more profound effects than noise.
  
  Reply View | 1 reply
  
  measurablefunc 3 days ago
  
  Just to make sure I got this right. They serve millions of requests a day & somehow catastrophic error accumulation is what is causing the 10% degradation & no one at Anthropic is noticing it. Is that the theory?
  
  Reply View | 0 replies

make3 3 days ago

There's a million algorithms to make LLM inference more efficient as a tradeoff for performance, like using a smaller model, using quantized models, using speculative decoding with a more permissive rejection threshold, etc etc

Reply View 0 replies

FL33TW00D 3 days ago

It takes a different code path for efficiency.

e.g

if (batch_size > 1024): kernel_x else: kernel_y

Reply View 0 replies