Comment by rnhmjoj

Comment by rnhmjoj 4 days ago

I don't understand, why do people resort to this tool instead of simply blocking by UA string or IP address. Are there so many people running these AI crawlers?

I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.

mnmalst 4 days ago

Because that solution simply does not work for all. People tried and the crawlers started using proxies with residential IPs.

Reply View 0 replies

hooverd 4 days ago

less savory crawlers use residential proxies and are indistinguishable from malware traffic

Reply View 0 replies

busterarm 4 days ago

Lots of companies run these kind of crawlers now as part of their products.

They buy proxies and rotate through proxy lists constantly. It's all residential IPs, so blocking IPs actually hurts end users. Often it's the real IPs of VPN service customers, etc.

There are lots of companies around that you can buy this type of proxy service from.

Reply View 0 replies

WesolyKubeczek 4 days ago

You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.

Reply View 9 replies

rnhmjoj 4 days ago

Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.
[1]: https://pod.geraspora.de/posts/17342163

Reply View | 3 replies
- nemothekid 4 days ago
  
  OpenAI/Anthropic/Perplexity aren't the bad actors here. If they are, they are relatively simply to block - why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?
  I get the sense many of the bad actors are simply poor copycats that are poorly building LLMs and are scraping the entire web without a care in the world
  
  Reply View | 2 replies
  
  rnhmjoj 4 days ago
  
  > why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?
  That's in fact what I was asking: I've only seen traffic from these kind of companies and I've easily blocked them without an annoying PoW scheme.
  I have yet to see any of these bad actors and I'm interested in knowing who they actually are.
  
  Reply View | 1 reply
  
  whatevaa 3 days ago
  
  Huawei. Be happy that you haven't been hit by them yet.
  
  Reply View | 0 replies
majorchord 4 days ago

> AI companies use residential proxies
Source:

Reply View | 4 replies
- Macha 4 days ago
  
  Source: Cloudflare
  https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
  Perplexity's defense is that they're not doing it for training/KB building crawls but for answering dynamic queries calls and this is apparently better.
  
  Reply View | 3 replies
  
  ranger_danger 4 days ago
  
  I do not see the words "residential" or "proxy" anywhere in that article... or any other text that might imply they are using those things. And personally... I don't trust crimeflare at all. I think they and their MITM-as-a-service has done even more/lasting damage to the global Internet and user privacy in general than all AI/LLMs combined.
  However, if this information is accurate... perhaps site owners should allow AI/bot user agents but respond with different content (or maybe a 404?) instead, to try to prevent it from making multiple requests with different UAs.
  
  Reply View | 1 reply
  
  Symbiote 4 days ago
  
  I had 500,000 residential IPs make 1-4 requests each in the past couple of days.
  These had the same user agent (latest Safari), but previously the agent has been varied.
  Blocking this shit is much more complicated than any blocking necessary before 2024.
  The data is available for free download in bulk (it's a university) and this is advertised in several places, including the 429 response, the HTML source and the API documentation, but the AI people ignore this.
  
  Reply View | 0 replies
  
  Dylan16807 4 days ago
  
  Well yes it is better. It's a page load triggered by a user for their own processing.
  If web security worked a little differently, the requests would likely come from the user's browser.
  
  Reply View | 0 replies