Comment by Night_Thastus
Comment by Night_Thastus 4 days ago
It's already happened accidentally many times - a popular site (like reddit) posts something intended as a joke - and it ends up scooped up into the LLM training and shows up years later in results.
It's very annoying. It's part of the problem with LLMs in general, there's no quality control. Their input is the internet, and the internet is full of garbage. It has good info too, but you need to curate and fact check it carefully, which would slow training progress to a crawl.
Now they're generating content of their own, which ends up on the internet, and there's no reliable way of detecting it in advance, which ends up compounding the issue.
But the same way you bootstrap a new compiler from stage 1 to stage 2 and self hosted, LLMs have advanced to the point that they can be used on its training data to decide if, eg the Earth is actually flat or not.