Comment by nnurmanov

Comment by nnurmanov 8 hours ago

I agree. Here is my thinking. What if LLM providers will make short answers the default (for example, up to 200 tokens, unless the user explicitly enables “verbose mode”). Add prompt caching and route simple queries to smaller models. Result: a 70%+ reduction in energy consumption without loss of quality. Current cost: 3–5 Wh per request. At ChatGPT scale, this is $50–100 million per year in electricity (at U.S. rates).

In short mode: 0.3–0.5 Wh per request. That is $5–10 million per year — savings of up to 90%, or 10–15 TWh globally with mass adoption. This is equivalent to the power supply of an entire country — without the risk of blackouts.

This is not rocket science — just a toggle in the interface and I believe, minor changes in the system prompt. It increases margins, reduces emissions, and frees up network resources for real innovation.

And what if EU/California enforces such mode? This will greatly impact DC economy.

cryptophreak 8 hours ago

Can you explain why a low-hanging optimization that would reduce costs by 90% without reducing perceived value hasn't been implemented?

Reply View 3 replies

lmm 8 hours ago

> Can you explain why a low-hanging optimization that would reduce costs by 90% without reducing perceived value hasn't been implemented?
Because the industry is running on VC funny-money where there is nothing to be gained by reducing costs.
(A similar feature was included in GPT-5 a couple of weeks ago actually, which probably says something about where we are in the cycle)

Reply View | 0 replies
balder1991 8 hours ago

Not sure that’s even possible with ChatGPT embedding your chat history in the prompts to try to give more personal answers.

Reply View | 0 replies
randomNumber7 5 hours ago

Dunning Krueger

Reply View | 0 replies

EagnaIonat 5 hours ago

Good enough models can already run on laptops.

Reply View 0 replies

nrhrjrjrjtntbt 7 hours ago

Context Tokens want a word...

Reply View 0 replies