Comment by boznz
Wake me back up when LLM's have a way to fact-check and correct their training data real-time.
Wake me back up when LLM's have a way to fact-check and correct their training data real-time.
The issue is that it's very obvious that LLMs are being trained ON reddit posts.
Doesn't really matter. All of the gains made before any funding collapse will exist.
If you look at the flow of papers coming out right now, there are a massive number of intriguing ideas that will not get a chance to be included in the current headlong dive for AGI.
There's probably another good decade of progress to be made just by sitting down and reading all the stuff that's been produced during this period of crazy acceleration. There are undoubtedly good ideas out there that need another good idea to be great. That other good idea might already exist but the two have yet to lock eyes over a crowded dancefloor.
How is that possible we have not figured out how to do this ourselves?
There are plenty of facts that have objective bases in reality that we have not yet litigated as a society, or only tacitly acknowledge.
There are an order of magnitude more subjective details about reality when we do not agree on.
They could do that years ago, it's just that nobody seems to do it. Just hook it up to curated semantic knowledge bases.
Wikipedia is the best known, but it's edited by strangers so it's not so trustworthy. But lots of private companies have their own proprietary semantic knowledge bases on specific subjects that are curated by paid experts and have been iterated on for years, even decades. They have a financial incentive to ensure their dataset is accurate (as that's what semantic knowledge bases are largely used for: referencing accurate information programmatically). So they are a lot more trustworthy than "I found a Reddit post that says..."
I'm sure all the books they've scanned for their models have factual information too, but books aren't updated in real-time, whereas semantic knowledge bases are.