Comment by Scipio_Afri
Comment by Scipio_Afri 2 days ago
Hey Simon, do you have any posts diving into how one might be able to deal with evaluating LLMs or Machine Learning models in general when reproducibility is so difficult given non-determinism? Pytorch has an article on it https://docs.pytorch.org/docs/stable/notes/randomness.html but then doesn't really go into how one would then take this deterministic result, and evaluate a model that is in production (which would very likely need for performance reasons the non-determinism features enabled).
While this affects all models it seems, I think the case gets worse for in particular LLMs because I would imagine all backends, including proprietary ones, are batching users prompts. Other concurrent requests seem to change the output of your request, and then if there is even a one token change to the input or output token, especially on large inputs or outputs, the divergence can compound. Also vLLM's documentation mentions this: https://docs.vllm.ai/en/latest/usage/faq.html
So how does one do benchmarking of AI/ML models and LLMs reliably (lets ignore arguing over the flaws of the metrics themselves, and just the fact that the output for any particular input can diverge given the above). You'd also want to redo evals as soon as any hardware or software stack changes are made to the production environment.
Seems like one needs to setup a highly deterministic backend, by forcing non-deterministic behavior in pytorch and using a backend which doesn't do batching for an initial eval that would allow for troubleshooting and non-variation in output to get a better sense of how consistent the model without the noise of batching and non-deterministic GPU calculations/kernels etc.
However then, for production, when determinism isn't guaranteed because you'd need batching and non-determism for performance, I would think that one would want to do multiple runs in various real-world situations (such as multiple users doing all sorts of different queries at the same time) and do some sort of averaging of the results. But I'm not entirely sure, because I would imagine the types of queries other users are making would then change the results fairly significantly. I'm not sure how much the batching that vLLM does would change the results of the output; but vLLM does say that batching does influence changes in the outputs.
This is so hard! I don't yet have a great solution for this myself, but I've been collecting notes about this on my "evals" tag for a while: https://simonwillison.net/tags/evals/
The best writing I've seen about this is from Hamel Husain - https://hamel.dev/blog/posts/llm-judge/ and https://hamel.dev/blog/posts/evals-faq/ are both excellent.