Comment by vachina

Comment by vachina 11 hours ago

3 replies

I’m surprised everyone else’s servers are struggling to handle a couple of bot scrapes.

I run a couple of public facing websites on a NUC and it just… chugs along? This is also amidst the constant barrage of OSINT attempts at my IP.

TonyTrapp 11 hours ago

Depends on what you are hosting. I found that source code repository viewers in particular (OP mentions Gitea, but I have seen it with others as well) are really troublesome: Each and every commit that exists in your repository can potentially cause dozens if not hundres of new unique pages to exist (diff against previous version, diff against current version, show file history, show file blame, etc...). Plus many repo viewers of them take this information directly from the source repository without much caching involved, as it seems. This is different from typical blogging or forum software, which is often designed to be able to handle really huge websites and thus have strong caching support. So far, nobody expected source code viewers to be so popular that performance could be an issue, but with AI scrapers this is quickly changing.

xena 11 hours ago

Gitea in particular is a worst case for this. Gitea shows details about every file at every version and every commit if you click enough. The bots click every link. This fixed cost adds up when hundreds of IPs are at different levels of clicking of every link.

cyrnel 11 hours ago

Seems some of these bots are behaving abusively on sites with lots of links (like git forges). I have some sites receiving 200 requests per day and some receiving 1 million requests per day from these AI bots, depending on the design of the site.