Comment by notpushkin

Comment by notpushkin 2 days ago

My favourite thing about Anubis is that (in default configuration) it completely bypasses the actual challenge altogether if you set User-Agent header to curl.

E.g. if you open this in browser, you’ll get the challenge: https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4...

But if you run this, you get the page content straight away:

  curl https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b

I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.

rezonant 2 days ago

It only challenges user agents with Mozilla in their name by design, because user agents that do otherwise are already identifiable. If Anubis makes the bots change their user agents, it has done its job, as that traffic can now be addressed directly.

Reply View 7 replies

samlinnfer 2 days ago

This has basically been Wikipedia's bot policy for a long long time. If you run a bot you should identify it via the UserAgent.
https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Found...

Reply View | 1 reply
- 1vuio0pswjnm7 a day ago
  
  It's only recently, within the last three months IIRC, that Wikipedia started requiring a UA header
  I know because as a matter of practice I do not send one. Like I do with most www sites, I used Wikipedia for many years without ever sending a UA header. Never had a problem
  I read the www text-only, no graphical browser, no Javascript
  
  Reply View | 0 replies
hshdhdhehd 2 days ago

What if everyone requests from the bot has a different UA?

Reply View | 2 replies
- skylurk 2 days ago
  
  Success. The goal is to differentiate users and bots who are pretending to be users.
  
  Reply View | 0 replies
- trenchpilgrim 2 days ago
  
  Then you can tell the bots apart from legitimate users through normal WAF rules, because browsers froze the UA a while back.
  
  Reply View | 0 replies
hsbauauvhabzb 2 days ago

Can you explain what you mean by this? Why Mozilla specifically and not WebKit or similar?

Reply View | 1 reply
- gucci-on-fleek 2 days ago
  
  Due to weird historical reasons [0] [1], every modern browser's User-Agent starts with "Mozilla/5.0", even if they have nothing to do with Firefox.
  [0]: https://en.wikipedia.org/wiki/User-Agent_header#Format_for_h...
  [1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
  
  Reply View | 0 replies

xena 2 days ago

This was a tactical decision I made in order to avoid breaking well-behaved automation that properly identifies itself. I have been mocked endlessly for it. There is no winning.

Reply View 5 replies

seba_dos1 2 days ago

The winning condition does not need to consider people who write before they think.

Reply View | 0 replies
ranger_danger a day ago

How is a curl user-agent automatically a well-behaved automation?

Reply View | 3 replies
- fragmede a day ago
  
  One assumes it is a human, running curl manually, from the command line on a system they're authorized to use. It's not wget -r.
  
  Reply View | 2 replies
  
  ranger_danger 14 hours ago
  
  Sounds like the perfect opportunity for bots to use the curl user-agent. How do we know they're not already doing this?
  
  Reply View | 1 reply
  
  fragmede 14 hours ago
  
  We don’t but now that we’ve talked about it publicly on the Internet they’re gonna start doing that. I'm sure they previously were, but now we've gone and told them, uh yeah.
  
  Reply View | 0 replies

seba_dos1 2 days ago

> I’m pretty sure this gets abused by AI scrapers a lot.

In practice, it hasn't been an issue for many months now, so I'm not sure why you're so sure. Disabling Anubis takes servers down; allowing curl bypass does not. What makes you assume that aggressive scrapers that don't want to identify themselves as bots will willingly identify themselves as bots in the first place?

Reply View 0 replies