Comment by kossTKR
Really? So the system recognises someone asked the same question and serves the same answer? And who on earth shares the exact same context?
I mean i get the idea but sounds so incredibly rare it would mean absolutely nothing optimisation wise.
Yes. It is not incredibly rare, it's incredibly common. A huge percentage of queries to retail LLMs are things like "hello" and "what can you do", with static system prompts that make the total context identical.
It's worth maybe a 3% reduction in GPU usage. So call it a half billion dollars a year or so, for a medium to large service.