Comment by LoganDark

Comment by LoganDark 4 days ago

According to the article, nearly 50% of the dataset is synthetic (8T out of 17T tokens). I don't know what constitutes "a breadth of state-of-the-art rephrasing approaches", but I lack some confidence in models trained on LLM output, so I hope it wasn't that.

NitpickLawyer 4 days ago

> but I lack some confidence in models trained on LLM output, so I hope it wasn't that.

That's misguided. Models have been trained on synthetic data for ~2+ years already. The "model collapse" myth is based on a very poor paper that got waaaay more attention than it deserved (because negativity sells, I guess). In practice every lab out there is doing this, because it works.

Reply View 2 replies

LoganDark 3 days ago

When ChatGPT first released and jailbreaks were pretty easy, I was able to easily get some extremely good/detailed output from it, with very little errors or weirdness. Now even when I can get jailbreaks to work with their newer models, it's just not the same, and no open-source model or even commercial model has seem to come close to the quality of that very first release. They're all just weird, dumb, random or incoherent. I keep trying even the very large open-source or open-weights models, and new versions of OpenAI's models and Claude and Gemini and so on, but it just all sucks. It all feels like slop!
I'm convinced it's because that first ChatGPT release was probably trained on data almost entirely untainted by other LLMs, and it may no longer ever be possible to obtain such a dataset again. Every model feels so artificial and synthetic. I do not know for sure why this is, but I bet it has something to do with people thinking it's possible to programmatically generate almost half the dataset?! I feel like OpenAI's moat could have been the quality and authenticity of their dataset, since they scraped practically most of the internet before LLMs became widespread, but even they've probably lost it by now.
I haven't really internalized anything about "model collapse", other than that if you train an LLM on outputs from other LLMs, you will be training to emulate an imprecise version of an imprecise version of writing, which will be measurably and perceptibly worse than merely one layer of imprecise version of actual writing.

Reply View | 1 reply
- wuschel 3 days ago
  
  > I'm convinced it's because that first ChatGPT release was probably trained on data almost entirely untainted by other LLMs, and it may no longer ever be possible to obtain such a dataset again.
  Interesting statement. But wouldn’t that mean that Google is in an even better position in regard to primary, or at least pristine data?
  
  Reply View | 0 replies