ccgreg a day ago

Most academic AI research and AI startups find Common Crawl adequate for what they're doing. Common Crawl also has a lot of not-AI usage.

Jackson__ 2 days ago

Thinking that there is anything worth scraping past the llm-apocalypse is pure hubris imo. It is slop city out there, and unless you have an impossibly perfect classifier to detect it, 99.9% of all the great new "content" you scrape will be AI written.

E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.

fragmede 21 hours ago

I think that there are lots of people who are working from "first principles" and haven't even heard of common crawl or know how to use it.