Comment by geokon

Comment by geokon 2 days ago

Big picture, why does everyone scrape the web?

Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?

utopiah 2 days ago

My bet is that they believe https://commoncrawl.org isn't good enough and, precisely as you are suggesting, the "rest" is where is their competitive advantage might stem from.

Reply View 3 replies

ccgreg a day ago

Most academic AI research and AI startups find Common Crawl adequate for what they're doing. Common Crawl also has a lot of not-AI usage.

Reply View | 0 replies
Jackson__ 2 days ago

Thinking that there is anything worth scraping past the llm-apocalypse is pure hubris imo. It is slop city out there, and unless you have an impossibly perfect classifier to detect it, 99.9% of all the great new "content" you scrape will be AI written.
E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.

Reply View | 0 replies
fragmede 21 hours ago

I think that there are lots of people who are working from "first principles" and haven't even heard of common crawl or know how to use it.

Reply View | 0 replies