Comment by mschuster91

Comment by mschuster91 13 hours ago

Global tarpit is the solution. It makes sense anyway even without taking AI crawlers into account. Back when I had to implement that, I went the semi manual route - parse the access log and any IP address averaging more than X hits a second on /api gets a -j TARPIT with iptables [1].

Not sure how to implement it in the cloud though, never had the need for that there yet.

[1] https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62f...

jks 13 hours ago

One such tarpit (Nepenthes) was just recently mentioned on Hacker News: https://news.ycombinator.com/item?id=42725147

Their site is down at the moment, but luckily they haven't stopped Wayback Machine from crawling it: https://web.archive.org/web/20250117030633/https://zadzmo.or...

Reply View 2 replies

marcus0x62 12 hours ago

Quixotic[0] (my content obfuscator) includes a tarpit component, but for something like this, I think the main quixotic tool would be better - you run it against your content once, and it generates a pre-obfuscated version of it. It takes a lot less of your resources to serve than dynamically generating the tarpit links and content.
0 - https://marcusb.org/hacks/quixotic.html

Reply View | 0 replies
kazinator 13 hours ago

How do you know their site is down? You probably just hit their tarpit. :)

Reply View | 0 replies

bwfan123 13 hours ago

i would think public outcry by influencers on social media (such as this thread) is a better deterrent, and also establishes a public datapoint and exhibit for future reference.. as it is hard to scale the tarpit.

Reply View 0 replies

seethenerdz 4 hours ago

Don't we have intellectual property law for this tho?

Reply View 0 replies

idlewords 11 hours ago

This doesn't work with the kind of highly distributed crawling that is the problem now.

Reply View 0 replies