Comment by boznz

Comment by boznz 4 days ago

Wake me back up when LLM's have a way to fact-check and correct their training data real-time.

They could do that years ago, it's just that nobody seems to do it. Just hook it up to curated semantic knowledge bases.

Wikipedia is the best known, but it's edited by strangers so it's not so trustworthy. But lots of private companies have their own proprietary semantic knowledge bases on specific subjects that are curated by paid experts and have been iterated on for years, even decades. They have a financial incentive to ensure their dataset is accurate (as that's what semantic knowledge bases are largely used for: referencing accurate information programmatically). So they are a lot more trustworthy than "I found a Reddit post that says..."

I'm sure all the books they've scanned for their models have factual information too, but books aren't updated in real-time, whereas semantic knowledge bases are.

Reply View 2 replies

justinator 4 days ago

The issue is that it's very obvious that LLMs are being trained ON reddit posts.

Reply View | 1 reply
- mrweasel 4 days ago
  
  That's really the issue isn't it. Many of the LLMs are trained uncritically on very thing. All data is viewed as viable training data, but it's not. Reddit clearly have good data, but it's probably mostly garbage.
  
  Reply View | 0 replies

Lerc 4 days ago

I kind of hope that they will get there. I don't know that they will, but I'm hopeful. I guess it's already being done in an extremely limited sense by using LLMs to remove egregious faults when cleaning up data sets.

Reply View 2 replies

fragmede 4 days ago

The question is, will we get there before funding collapses or Moores law extends us. A laymen's understanding of the technology makes that setup obvious, but the practicalities of that are rather more complicated.

Reply View | 1 reply
- Lerc 4 days ago
  
  Doesn't really matter. All of the gains made before any funding collapse will exist.
  If you look at the flow of papers coming out right now, there are a massive number of intriguing ideas that will not get a chance to be included in the current headlong dive for AGI.
  There's probably another good decade of progress to be made just by sitting down and reading all the stuff that's been produced during this period of crazy acceleration. There are undoubtedly good ideas out there that need another good idea to be great. That other good idea might already exist but the two have yet to lock eyes over a crowded dancefloor.
  
  Reply View | 0 replies

vrighter 4 days ago

It would require some sort of ai that actually works, not fakes it, to do so. If you had that, then you'd be using it directly. It's a chicken and egg situation.

Reply View 0 replies

thorncorona 4 days ago

How is that possible we have not figured out how to do this ourselves?

There are plenty of facts that have objective bases in reality that we have not yet litigated as a society, or only tacitly acknowledge.

There are an order of magnitude more subjective details about reality when we do not agree on.

Reply View 0 replies