Comment by msp26

Comment by msp26 9 hours ago

> because there's already concern that AI models are getting worse. The models are being fed on their own AI slop and synthetic data in an error-magnifying doom-loop known as "model collapse."

Model collapse is a meme that assumes zero agency on the part of the researchers.

I'm unsure how you can have this conclusion when trying any of the new models. In the frontier size bracket we have models like Opus 4.5 that are significantly better at writing code and using tools independently. In the mid tier Gemini 3.0 flash is absurdly good and is crushing the previous baseline for some of my (visual) data extraction projects. And small models are much better overall than they used to be.

Ifkaluva 8 hours ago

The big labs spend a ton of effort on dataset curation.

It goes further than just preventing poison—they do lots of testing on the dataset to find the incremental data that produces best improvements on model performance, and even train proxy models that predict whether data will improve performance or not. “Data Quality” is usually a huge division with a big budget.

Reply View 0 replies

soulofmischief 8 hours ago

Even if it's a meme for the general public, actual ML researchers do have to document, understand and discuss the concept of model collapse in order to avoid it.

Reply View 0 replies

biophysboy 7 hours ago

Yes, this particular threat seems silly to me. Isn't it a standard thing to rollback databases? If the database gets worse, roll it back and change your data ingestion approach.

Reply View 0 replies

stonogo 5 hours ago

The common thread from all the frontier orgs is that the datasets are too big to vet, and they're spending lots of money on lobbying to ensure they don't get punished for that. In short, the current corporate stance seems to be that they have zero agency, so which is it?

Reply View 1 reply

NewsaHackO 4 hours ago

Huh? Unless you are talking about DMCA, I haven't heard about that at all. Most AI companies go to great lengths to prevent exfiltration of copyrighted material.

Reply View | 0 replies

mrtesthah 9 hours ago

Coding and reasoning skills can be improved using machine-driven reinforcement learning.

https://arxiv.org/abs/2501.12948

Reply View 0 replies

conartist6 8 hours ago

Well, they seem to have 0 agency. They left child pornography in the training sets. The people gathering the data committed enormous crimes, wantonly. Science is disintegrating along with public trust in science as fake papers peer reviewed by fake peer reviewers slop along. And from what I hear there has been no more training on the open internet anymore in recent years as it's simply too toxic.

Reply View 0 replies