Comment by armchairhacker
Comment by armchairhacker 12 hours ago
I like the solution in this comment: https://news.ycombinator.com/item?id=42727510.
Put a link somewhere in your site that no human would visit, disallow it in robots.txt (under a wildcard because apparently OpenAI’s crawler specifically ignores wildcards), and when an IP address visits the link ban it for 24 hours.
I had to deal with some bot activities that used huge address space, and I tried something very similar, when condition confirming bot was detected I banned that IP for 24h
but due to amount of IPs involved this did not have any impact on about if traffic
my suggestion is to look very closely on headers that you receive (varnishlog in very nice of this and of you stare long enough at then you might stop something that all those requests have in common that would allow you to easily identify them (like very specific and usual combination of reported language and geo location, or the same outdated browser version, etc)