Comment by andai

Comment by andai a day ago

>just set temp to 0 to make LLMs deterministic

Does that really work? And is it affected by the almost continuous silent model updates? And gpt-5 has a "hidden" system prompt, even thru the API, which seemed to undergo several changes since launch...

simonw a day ago

It famously does not: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

Reply View 3 replies

andai a day ago

>But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism.
Dang, so we don't even know why it's not deterministic, or how to make it so? That's quite surprising! So if I'm reading this right, it doesn't just have to do with LLM providers cutting costs or making changes or whatever. You can't even get determinism locally. That's wild.
But I did read something just the other day about LLMs being invertible. It goes over my head but it sounds like they got a pretty reliable mapping from inputs to outputs, at least?
https://news.ycombinator.com/item?id=45758093
> Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions.
The distinction here appears to be between the output tokens versus some sort of internal state?

Reply View | 0 replies
Scipio_Afri a day ago

Hey Simon, do you have any posts diving into how one might be able to deal with evaluating LLMs or Machine Learning models in general when reproducibility is so difficult given non-determinism? Pytorch has an article on it https://docs.pytorch.org/docs/stable/notes/randomness.html but then doesn't really go into how one would then take this deterministic result, and evaluate a model that is in production (which would very likely need for performance reasons the non-determinism features enabled).
While this affects all models it seems, I think the case gets worse for in particular LLMs because I would imagine all backends, including proprietary ones, are batching users prompts. Other concurrent requests seem to change the output of your request, and then if there is even a one token change to the input or output token, especially on large inputs or outputs, the divergence can compound. Also vLLM's documentation mentions this: https://docs.vllm.ai/en/latest/usage/faq.html
So how does one do benchmarking of AI/ML models and LLMs reliably (lets ignore arguing over the flaws of the metrics themselves, and just the fact that the output for any particular input can diverge given the above). You'd also want to redo evals as soon as any hardware or software stack changes are made to the production environment.
Seems like one needs to setup a highly deterministic backend, by forcing non-deterministic behavior in pytorch and using a backend which doesn't do batching for an initial eval that would allow for troubleshooting and non-variation in output to get a better sense of how consistent the model without the noise of batching and non-deterministic GPU calculations/kernels etc.
However then, for production, when determinism isn't guaranteed because you'd need batching and non-determism for performance, I would think that one would want to do multiple runs in various real-world situations (such as multiple users doing all sorts of different queries at the same time) and do some sort of averaging of the results. But I'm not entirely sure, because I would imagine the types of queries other users are making would then change the results fairly significantly. I'm not sure how much the batching that vLLM does would change the results of the output; but vLLM does say that batching does influence changes in the outputs.

Reply View | 1 reply
- simonw a day ago
  
  This is so hard! I don't yet have a great solution for this myself, but I've been collecting notes about this on my "evals" tag for a while: https://simonwillison.net/tags/evals/
  The best writing I've seen about this is from Hamel Husain - https://hamel.dev/blog/posts/llm-judge/ and https://hamel.dev/blog/posts/evals-faq/ are both excellent.
  
  Reply View | 0 replies

aeve890 a day ago

Strictly speaking, it should work. We don't have a _real_ RNG yet and with the same seed any random function becomes deterministic. But behind the blackbox of LLM providers who know what's tunned processing your request.

But my point stands. The non-deterministic nature of LLMs are implementation details, not even close to physical constraints as the parent comment suggest.

Reply View 1 reply

Scipio_Afri a day ago

There is inherent non-determinism in all machine learning models unless you explicitly configure pytorch or other frameworks to do determinism (https://docs.pytorch.org/docs/stable/notes/randomness.html). However, this is very unlikely to be done in models that are being run in production due to performance and other issues.

Reply View | 0 replies