Comment by Ronsenshi

Comment by Ronsenshi a day ago

One thing about Google is that many anti-scraping services explicitly allow access to Google and maybe couple of other search engines. Everybody else gets to enjoy CloudFlare captcha, even when doing crawling at reasonable speeds.

Rules For Thee but Not for Me

chii a day ago

> many anti-scraping services explicitly allow access to Google and maybe couple of other search engines.

because google (and the couple of other search engines) provide enough value that offset the crawler's resource consumption.

Reply View 1 reply

JasonADrury a day ago

That's cool, but it's impossible for anyone to ever build a competitor that'd replace google without bypassing such services.

Reply View | 0 replies

ehhthing a day ago

You say this like robots.txt doesn't exist.

Reply View 2 replies

toofy 21 hours ago

it almost sounds like they’re saying the contents of robots.txt shouldn’t matter… because google exists? or something?
implying “robots.txt explicitly says i can’t scrape their site, well i want that data, so im directing my bot to take it anyway.”

Reply View | 0 replies
sitzkrieg 21 hours ago

so many things flat out ignore it in 2026 let's be real

Reply View | 0 replies

ErroneousBosh a day ago

Why are you scraping sites in the first place? What legitimate reason is there for you doing that?

Reply View 5 replies

Ronsenshi 18 hours ago

Just today I wanted to get a list of locations of various art events around the city which are all located on the same website, but which does not provide a page with all events happening this month on a map. I need a single map to figure out what I want to visit based on distance I have to travel, unfortunately that's not an option - only option is to go through hundreds of items and hope whatever I picked is near me.
Do you think this is such a horrible thing to scrape? I can't do it manually since there are few hundred locations. I could write some python script which uses playwrite to scrape things using my desktop browser in order to avoid CloudFlare. Or, which I am much more familiar with, I could write a python script that uses BeautifulSoup to extract all the relevant locations once for me. I would have been perfectly happy fetching 1 page/sec or even 1 page/2 seconds and would still be done within 20 minutes if only there was no anti-scraping protection.
Scraping is a perfectly legal activity, after all. Except thanks to overly-eager scraping bots and clueless/malicious people who run them there's very little chance for anyone trying to compete with Google or even do small scale scraping to make their life and life of local art enthusiasts easier. Google owns search. Google IS search and no competition is allowed, it seems.

Reply View | 2 replies
- ErroneousBosh 15 hours ago
  
  If you want the data, why not contact the organisation with the website?
  Why is hammering the everloving fuck out of their website okay?
  
  Reply View | 1 reply
  
  Saris 11 hours ago
  
  1 request per second is nowhere even close to hammering a website.
  They made the data available on the website already, there's no reason to contact them when you can just load it from their website.
  
  Reply View | 0 replies
Saris 11 hours ago

I've used change detection for in-stock alerts before, or event updates. Plenty of legitimate uses.

Reply View | 0 replies
digiown 19 hours ago

Dunno, building a Google competitor? How do you think Google got started?

Reply View | 0 replies