Comment by armchairhacker

Comment by armchairhacker 6 months ago

I like the solution in this comment: https://news.ycombinator.com/item?id=42727510.

Put a link somewhere in your site that no human would visit, disallow it in robots.txt (under a wildcard because apparently OpenAI’s crawler specifically ignores wildcards), and when an IP address visits the link ban it for 24 hours.

Szpadel 6 months ago

I had to deal with some bot activities that used huge address space, and I tried something very similar, when condition confirming bot was detected I banned that IP for 24h

but due to amount of IPs involved this did not have any impact on about if traffic

my suggestion is to look very closely on headers that you receive (varnishlog in very nice of this and of you stare long enough at then you might stop something that all those requests have in common that would allow you to easily identify them (like very specific and usual combination of reported language and geo location, or the same outdated browser version, etc)

Reply View 14 replies

conradev 6 months ago

My favorite example of this was how folks fingerprinted the active probes of the Great Firewall of China. It has a large pool of IP addresses to work with (i.e. all ISPs in China), but the TCP timestamps were shared across a small number of probing machines:
"The figure shows that although the probers use thousands of source IP addresses, they cannot be fully independent, because they share a small number of TCP timestamp sequences"
https://censorbib.nymity.ch/pdf/Alice2020a.pdf

Reply View | 1 reply
- rnewme 6 months ago
  
  This is a very cool read, thanks
  
  Reply View | 0 replies
trod1234 6 months ago

If you just block the connection, you send a signal that you are blocking it, and they will change it. You need to impose cost per every connection through QoS buckets.
If they rotate IPs, ban by ASN, have a page with some randomized pseudo looking content in the source (not static), and explain that the traffic allocated to this ASN has exceed normal user limits and has been rate limited (to a crawl).
Have graduated responses starting at a 72 hour ban where every page thereafter regardless of URI results in that page and rate limit. Include a contact email address that is dynamically generated by bucket, and validate all inbound mail that it matches DMARC for Amazon. Be ready to provide a log of abusive IP addresses.
That way if amazon wants to take action, they can but its in their ballpark. You gatekeep what they can do on your site with your bandwidth. Letting them run hog wild and steal bandwidth from you programmatically is unacceptable.

Reply View | 1 reply
- dredmorbius 6 months ago
  
  It's possible to ban at finer granularity, specifically CIDR blocks, using the Routeviews project reverse-DNS lookup:
  <https://www.routeviews.org/routeviews/>
  That also provides the associated AS, enabling blocking at that level as well, if warranted.
  
  Reply View | 0 replies
[removed] 6 months ago

[deleted]

Reply View | 0 replies
aaomidi 6 months ago

Maybe ban ASNs /s

Reply View | 2 replies
- koito17 6 months ago
  
  This was indeed one mitigation used by a site to prevent bots hosted on AWS from uploading CSAM and generating bogus reports to the site's hosting provider.[1]
  In any case, I agree with the sarcasm. Blocking data center IPs may not help the OP, because some of the bots are resorting to residential IP addresses.
  [1] https://news.ycombinator.com/item?id=26865236
  
  Reply View | 1 reply
  
  pixl97 6 months ago
  
  Ya if it's also coming from residences it's probably some kind of botnet
  
  Reply View | 0 replies
superjan 6 months ago

Why work hard… Train a model to recognize the AI bots!

Reply View | 5 replies
- js4ever 6 months ago
  
  Because you have to decide in less than 1ms, using AI is too slow in that context
  
  Reply View | 2 replies
  
  Dylan16807 6 months ago
  
  You can delay the first request from an IP by a lot more than that without causing problems.
  
  Reply View | 0 replies
  
  franktankbank 6 months ago
  
  Train with a bdt.
  
  Reply View | 0 replies
- trod1234 6 months ago
  
  This isn't a problem domain that models are capable of solving.
  Ultimately in two party communications, computers are mostly constrained by determinism, and the resulting halting/undecidability problems (in core computer science).
  All AI Models are really bad at solving stochastic types of problems. They can approximate generally only to a point after which it falls off. Temporal consistency im time series data is also a major weakness. Throw the two together, and models can't really solve it. They can pattern match to a degree but that is the limit.
  
  Reply View | 1 reply
  
  seethenerdz 6 months ago
  
  When all you have is a Markov generator and $5 billion, every problem starts to look like a prompt. Or something like that.
  
  Reply View | 0 replies

to11mtm 6 months ago

Uggh, web crawlers...

8ish years ago, at the shop I worked at we had a server taken down. It was an image server for vehicles. How did it go down? Well, the crawler in question somehow had access to vehicle image links we had due to our business. Unfortunately, the perfect storm of the image not actually existing (can't remember why, mighta been one of those weird cases where we did a re-inspection without issuing new inspection ID) resulted in them essentially DOSing our condition report image server. Worse, there was a bug in the error handler somehow, such that the server process restarted when this condition happened. This had the -additional- disadvantage of invalidating our 'for .NET 2.0, pretty dang decent' caching implementation...

It comes to mind because, I'm pretty sure we started doing some canary techniques just to be safe (Ironically, doing some simple ones were still cheaper than even adding a different web server.... yes we also fixed the caching issue... yes we also added a way to 'scream' if we got too many bad requests on that service.)

Reply View 0 replies

shakna 6 months ago

When I was writing a crawler for my search engine (now offline), I found almost no crawler library actually compliant with the real world. So I ended up going to a lot of effort to write one that complied with Amazon and Google's rather complicated nested robots files, including respecting the cool off periods as requested.

... And then found their own crawlers can't parse their own manifests.

Reply View 2 replies

bb010g 6 months ago

Could you link the source of your crawler library?

Reply View | 1 reply
- shakna 6 months ago
  
  It's about 700 lines of the worst Python ever. You do not want it. I would be too embarrassed to release it, honestly.
  It complied, but it was absolutely not fast or efficient. I aimed at compliance first, good code second, but never got to the second because of more human-oriented issues that killed the project.
  
  Reply View | 0 replies