Comment by visarga

This works great for software, math and games where you can have cheap validation. But what about messy real world tasks? I think hindsight learning from chat logs could fit the bill. What do I mean?

Imagine a long conversation. It is hard to judge if an AI response was useful or not immediately, but if you know the following 20 messages, it might be easy to infer. Not only you can see how it went, but sometimes you get real world validation.

For example a user comes to a LLM with a task, takes an idea, tries it in reality. Later they return, maybe in a new chat session, and continue iterating. You get real world testing of LLM responses through people.

This can be used to generate "preference scores", and train a preference model, with which you can do RLHF. So the user privacy is protected.

I call this the human-AI experience flywheel. Of course the larger the user base, the more experience the model collects. At the moment OpenAI has 500M users, they probably generate 0.5T interactive tokens/day. Those tokens go both into human brains and LLM logs.

It’s not about environment engineering anymore, it's about consequence harvesting. Meaningful validation emerges from systems actually being used by humans for real purposes.