Comment by LoganDark

Comment by LoganDark 3 days ago

1 reply

When ChatGPT first released and jailbreaks were pretty easy, I was able to easily get some extremely good/detailed output from it, with very little errors or weirdness. Now even when I can get jailbreaks to work with their newer models, it's just not the same, and no open-source model or even commercial model has seem to come close to the quality of that very first release. They're all just weird, dumb, random or incoherent. I keep trying even the very large open-source or open-weights models, and new versions of OpenAI's models and Claude and Gemini and so on, but it just all sucks. It all feels like slop!

I'm convinced it's because that first ChatGPT release was probably trained on data almost entirely untainted by other LLMs, and it may no longer ever be possible to obtain such a dataset again. Every model feels so artificial and synthetic. I do not know for sure why this is, but I bet it has something to do with people thinking it's possible to programmatically generate almost half the dataset?! I feel like OpenAI's moat could have been the quality and authenticity of their dataset, since they scraped practically most of the internet before LLMs became widespread, but even they've probably lost it by now.

I haven't really internalized anything about "model collapse", other than that if you train an LLM on outputs from other LLMs, you will be training to emulate an imprecise version of an imprecise version of writing, which will be measurably and perceptibly worse than merely one layer of imprecise version of actual writing.

wuschel 3 days ago

> I'm convinced it's because that first ChatGPT release was probably trained on data almost entirely untainted by other LLMs, and it may no longer ever be possible to obtain such a dataset again.

Interesting statement. But wouldn’t that mean that Google is in an even better position in regard to primary, or at least pristine data?