Comment by pixelesque

Sort of off-topic, but it does make one think about usage of compute (and the backing energy / resources required for that)...

i.e. it doesn't seem too much of an exaggeration to say that we might be getting closer and closer to a situation where LLMs (or any other ML inference) is being run so much for so many different reasons / requests, that the usage does become significant in the future.

Similarly, going into detail on what the compute is being used for: i.e. no doubt there are situations currently going on where Person A uses a LLM to expand something like "make a long detailed report about our sales figures", which produces a 20 page report and delivers it to Person B. Person B then says "I haven't time to read all this, LLM please summarise it for me".

So you'd basically have LLM inference compute being used as a very inefficient method of data/request transfer, with the sender expanding a short amount of information to deliver to the recipient, and then the said recipient using an LLM on the other side to reduce it back again to something more manage-able.