Comment by trollbridge
Comment by trollbridge 2 days ago
Yeah, it serves the purpose of blocking this kind of proxy traffic that isn't in Google's personal best interests.
Only Google is allowed to scrape the web.
Comment by trollbridge 2 days ago
Yeah, it serves the purpose of blocking this kind of proxy traffic that isn't in Google's personal best interests.
Only Google is allowed to scrape the web.
Common Crawl provides gzipped robots.txt collections
Google does not use residential proxies.
This does nothing against your ability to scrape the web the Google way, AKA from your own assigned IP range, obeying robots.txt, and with an user agent that explicitly says what you're doing and gives website owners a way to opt out.
What Google doesn't want (and I don't think that's a bad thing) is competitors scraping the web in bad faith, without disclosing what they're doing to site owners and without giving them the ability to opt out.
If Google doesn't stop these proxies, unscrupulous parties will have a competitive advantage over Google, it's that simple. Then Google will have to decide between just giving up (unlikely) or becoming unscrupulous themselves.
LLMs aren't a good indicator of success here because an LLM trained on 80% of the data is just as good as one trained on 100%, assuming the type/category of data is distributed evenly. Proxies help when you do need to get access to 100% of the data including data behind social media loginwalls.
That's the whole point. Websites that try to block scraping attempts will let google scrape without any hurdle because of google's ads and search network. This gives google some advantage over new players because as a new name brand you are hardly going to convince a website to allow scraping even if your product may actually be more advantageous to the website (for example assume you made a search engine that doesn't suck like google, and aggregates links instead of copying content from your website).
Proxies in comparison can allow new players to have some playing chance. That said I doubt any legitimate & ethical business would use proxies.
I don't think parent post is claiming that Google is using other people's networks to scrape the web only that they have a strong incentive to keep other players from doing that.
No, there are other scrapers that Google doesn't block or interact with. You can even run scraping from GCP. This has nothing to do with "only Google is allowed to scrape". They even host apps which exist for scraping data, like https://play.google.com/store/apps/details?id=com.sociallead...
"Only Google is allowed to scrape the web."
If I'm not mistaken, the plaintiffs in the US v Google antitrust litigation in the DC Circuit tried to argue that website operators are biased toward allowing Google to crawl and against allowing other search engines to do the same
The Court rejected this argument because the plaintiffs did not present any evidence to support it
For someone who does not follow the web's history, how would one produce direct evidence that the bias exists