Comment by edg5000

Comment by edg5000 a day ago

25 replies

Residential proxies are the only way to crawl and scrape. It's ironic for this article to come from the biggest scraping company that ever existed!

If you crawl at 1Hz per crawled IP, no reasonable server would suffer from this. It's the few bad apples (impatient people who don't rate limit) who ruin the internet for both users and hosters alike. And then there's Google.

mrweasel a day ago

First of: Google has not once crashed one of our sites with GoogleBot. They have never tried to by-pass our caching and they are open and honest about their IP ranges, allowing us to rate-limit if needed.

The residential proxies are not needed, if you behave. My take is that you want to scrape stuff that site owners do not want to give you and you don't want to be told no or perhaps pay a license. That is the only case where I can see you needing a residential proxies.

  • TZubiri a day ago

    >The residential proxies are not needed, if you behave

    I'm starting to think that somee users in hackernews do not 'behave' or at least they think they do not 'behave' and provide an alibi for those that do not 'behave'.

    That the hacker in hackernews does not attract just hackers as in 'hacking together features' but also hackers as in 'illegitimately gaining access to servers/data'

    As far as I can tell, as a hacker that hacks features together, resi proxies are something the enemy uses. Whenever I boot up a server and get 1000 log in requests per second and requests for commonly exploited files from russian and chinese IPs, those come from resi IPs no doubt. There's 2 sides to this match, no more.

  • tonymet a day ago

    You can’t get much crawling done from published cloud IPs. Residential proxies are the only way to do most crawls today.

    That said, I support Google working to shut these networks down, since they are almost universally bad.

    It’s just a shame that there’s no where to go for legitimate crawling activities.

    • mrweasel 21 hours ago

      > You can’t get much crawling done from published cloud IPs.

      Think about why that might be. I'm sorry, if you legitimately need to crawl the net, and do so from a cloud provide, your industry screwed you over with bad behaviour. Go get hosting with a company that cares about who their customers are, you're hanging out with a bad crowd.

      • tonymet 21 hours ago

        what industry is that? Every industry is on the cloud.

Ronsenshi a day ago

One thing about Google is that many anti-scraping services explicitly allow access to Google and maybe couple of other search engines. Everybody else gets to enjoy CloudFlare captcha, even when doing crawling at reasonable speeds.

Rules For Thee but Not for Me

  • chii a day ago

    > many anti-scraping services explicitly allow access to Google and maybe couple of other search engines.

    because google (and the couple of other search engines) provide enough value that offset the crawler's resource consumption.

    • JasonADrury a day ago

      That's cool, but it's impossible for anyone to ever build a competitor that'd replace google without bypassing such services.

  • ehhthing a day ago

    You say this like robots.txt doesn't exist.

    • toofy a day ago

      it almost sounds like they’re saying the contents of robots.txt shouldn’t matter… because google exists? or something?

      implying “robots.txt explicitly says i can’t scrape their site, well i want that data, so im directing my bot to take it anyway.”

    • sitzkrieg a day ago

      so many things flat out ignore it in 2026 let's be real

  • ErroneousBosh a day ago

    Why are you scraping sites in the first place? What legitimate reason is there for you doing that?

    • Ronsenshi 21 hours ago

      Just today I wanted to get a list of locations of various art events around the city which are all located on the same website, but which does not provide a page with all events happening this month on a map. I need a single map to figure out what I want to visit based on distance I have to travel, unfortunately that's not an option - only option is to go through hundreds of items and hope whatever I picked is near me.

      Do you think this is such a horrible thing to scrape? I can't do it manually since there are few hundred locations. I could write some python script which uses playwrite to scrape things using my desktop browser in order to avoid CloudFlare. Or, which I am much more familiar with, I could write a python script that uses BeautifulSoup to extract all the relevant locations once for me. I would have been perfectly happy fetching 1 page/sec or even 1 page/2 seconds and would still be done within 20 minutes if only there was no anti-scraping protection.

      Scraping is a perfectly legal activity, after all. Except thanks to overly-eager scraping bots and clueless/malicious people who run them there's very little chance for anyone trying to compete with Google or even do small scale scraping to make their life and life of local art enthusiasts easier. Google owns search. Google IS search and no competition is allowed, it seems.

      • ErroneousBosh 19 hours ago

        If you want the data, why not contact the organisation with the website?

        Why is hammering the everloving fuck out of their website okay?

        • Saris 15 hours ago

          1 request per second is nowhere even close to hammering a website.

          They made the data available on the website already, there's no reason to contact them when you can just load it from their website.

    • digiown a day ago

      Dunno, building a Google competitor? How do you think Google got started?

    • Saris 14 hours ago

      I've used change detection for in-stock alerts before, or event updates. Plenty of legitimate uses.

toofy a day ago

do we think a scraper should be allowed to take whatever means necessary to scrape a site if that site explicitly denies that scraper access?

if someone is abusing my site, and i block them in an attempt to stop that abuse, do we think that they are correct to tell me it doesn’t matter what i think and to use any methods they want to keep abusing it?

that seems wrong to me.

megous a day ago

I'd still like the ability to just block a crawler by its IP range, but these days nope.

1 Hz is 86400 hits per day, or 600k hits per week. That's just one crawler.

Just checked my access log... 958k hits in a week from 622k unique addresses.

95% is fetching random links from u-boot repository that I host, which is completely random. I blocked all of the GCP/AWS/Alibaba and of course Azure cloud IP ranges.

It's almost all now just comming of a "residential" and "mobile" IP address space from completely random places all around the world. I'm pretty sure my u-boot fork is not that popular. :-D

Every request is a new IP address, and available IP space of the crawler(s) is millions of addresses.

I don't host a popular repo. I host a bot attraction.

  • kstrauser a day ago

    I’ve been enduring that exact same traffic pattern.

    I used Anubis and a cookie redirect to cut the load on my Forgejo server by around 3 orders of magnitude: https://honeypot.net/2025/12/22/i-read-yann-espositos-blog.h...

    • plagiarist 20 hours ago

      Aha, that's where the anime girl is from. What sort of traffic was getting past that but still thwarted by the cookie tactic?

      I guess the bots are all spoofing consumer browser UAs and just the slightest friction outside of well-known tooling will deter them completely.

      • kstrauser 18 hours ago

        Yep, that’s why that’s all over the place now. The cookie thing is more of a first line of defense. It turns away a lot of shoddy scrapers with nearly no resources on my side. Anubis knocks out almost all of the remainder.