Comment by wraptile

Comment by wraptile 3 days ago

I'm a scraper developer and Anubis would have worked 10 - 20 years ago, but now all broad scrapers run on a real headless browser with full cookie support and costs relatively nothing in compute. I'd be surprised if LLM bots would use anything else given the fact that they have all of this compute and engineers already available.

That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale. It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.

You can actually see this in real life if you google web scraping services and which targets they claim to bypass - all of them bypass generic anti-bots like Cloudflare, Akamai etc. but struggle with custom and rare stuff like Chinese websites or small forums because scraping market is a market like any other and high value problems are solved first. So becoming a low value problem is a very easy way to avoid confrontation.

jandrese 3 days ago

> That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale.

Isn't this what Microsoft is trying to do with their sliding puzzle piece and choose the closest match type systems?

Also, if you come in on a mobile browser it could ask you to lay your phone flat and then shake it up and down for a second or something similar that would be a challenge for a datacenter bot pretending to be a phone.

Reply View 0 replies

DanielHB 3 days ago

How do you bypass cloudflare? I do some light scrapping for some personal stuff, but I can't figure out how to bypass it. Like do you randomize IPs using several VPNs at the same time?

I usually just sit there on my phone pressing the "I am not a robot box" when it triggers.

Reply View 7 replies

wraptile 2 days ago

It's still pretty hard to bypass it with open source solutions. To bypass CF you need:
- an automated browser that doesn't leak the fact it's being automated
- ability to fake the browser fingerprint (e.g. Linux is heavily penalized)
- residential or mobile proxies (for small scale your home IP is probably good enough)
- deployment environment that isn't leaked to the browser.
- realistic scrape pattern and header configuration (header order, referer, prewalk some pages with cookies etc.)
This is really hard to do at scale but for small personal scripts you can have reasonable results with flavor of the month playwright forks on github like nodriver or dedicated tools like Flaresolver but I'd just find a web scraping api with low entry price and just drop 15$ month and avoid this chase because it can be really time consuming.
If you're really on budget - most of them offer 1,000 credits for free which will get you avg 100 pages a month per service and you can get 10 of them as they all mostly function the same.

Reply View | 0 replies
hinach4n 3 days ago

I believe usually you would bypass by using residential ips / proxies?

Reply View | 2 replies
- DanielHB 3 days ago
  
  I run it through my home network and I'm still triggering it. I add 2s delays between page load and it still triggers
  
  Reply View | 1 reply
  
  jijijijij 3 days ago
  
  Well, if that's true... I am so sorry to tell you this, it looks like you are in fact a robot.
  
  Reply View | 0 replies
1gn15 2 days ago

I use Camoufox for the browser and "playwright-captcha" for the CAPTCHA solving action. It's not fully reliable but it works.

Reply View | 0 replies
Gander5739 3 days ago

Flaresolverr can bypass it.

Reply View | 0 replies
buckle8017 3 days ago

Ironically by runnung cloudflare warp.

Reply View | 0 replies

miki123211 3 days ago

This only works if you're a low-value site (which admittedly most sites are).

Reply View 0 replies

hahn-kev 3 days ago

Bot blocking through obscurity

Reply View 3 replies

lbhdc 3 days ago

That's really the only option available here, right? The goal is to keep sites low friction for end users while stopping bots. Requiring an account with some moderation would stop the majority of bots, but it would add a lot of friction for your human users.

Reply View | 1 reply
- brookst 3 days ago
  
  The other option is proof of work. Make clients use JS to do expensive calculations that aren’t a big deal for single clients, but get expensive at scale. Not ideal, but another tool to potentially use.
  
  Reply View | 0 replies
tovej 3 days ago

I like it, make the bot developers play whack-a-mole.
Of course, you're going to have to verify each custom puzzle aren't you.

Reply View | 0 replies

sam0x17 3 days ago

> It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.

These are trivial for an AI agent to solve though, even with very dumb watered down models.

Reply View 0 replies

andai 3 days ago

You can also generate custom solutions at scale with LLMs. So each user could get a different CAPTCHA.

Reply View 2 replies

josh-sematic 3 days ago

At that point you’re probably spending more money blocking the scrapers than you would spend just letting them through.

Reply View | 1 reply
- lbhdc 3 days ago
  
  That seems like it would make bot blocking saas (like cloudflare or tollbit) more attractive because it could amortize that effort/cost across many clients.
  
  Reply View | 0 replies