Comment by nnurmanov
I agree. Here is my thinking. What if LLM providers will make short answers the default (for example, up to 200 tokens, unless the user explicitly enables “verbose mode”). Add prompt caching and route simple queries to smaller models. Result: a 70%+ reduction in energy consumption without loss of quality. Current cost: 3–5 Wh per request. At ChatGPT scale, this is $50–100 million per year in electricity (at U.S. rates).
In short mode: 0.3–0.5 Wh per request. That is $5–10 million per year — savings of up to 90%, or 10–15 TWh globally with mass adoption. This is equivalent to the power supply of an entire country — without the risk of blackouts.
This is not rocket science — just a toggle in the interface and I believe, minor changes in the system prompt. It increases margins, reduces emissions, and frees up network resources for real innovation.
And what if EU/California enforces such mode? This will greatly impact DC economy.
Can you explain why a low-hanging optimization that would reduce costs by 90% without reducing perceived value hasn't been implemented?