Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass

135 points by thalissonvs 6 days ago

I think I will add this to my AIO package. My project allows to crawl pages. Provides a barebones page, and scraping results are passed as JSON.

This is something that was very useful for me not to setup selenium for the x time. I just use one crawling server for my projects.

Link:

https://github.com/rumca-js/crawler-buddy

Reply View 1 reply

thalissonvs 6 days ago

cool, left a star :)

Reply View | 0 replies

jdnier 6 days ago

Hi, just wondering what you're thinking about how your tool might be abused.

Reply View 22 replies

voidmain0001 6 days ago

I will be using Pydoll for the following legitimate use case: a franchisee is given access to their data as controlled by the franchise through a web site. The franchisee uses browser automation to retrieve its data but now the franchise has deployed a WAF that blocks Chrome webdriver. This is not a public web site and the data is not public so it frustrates the franchisee because it just wants its data which is paid for by its franchisee fees.

Reply View | 0 replies
Galanwe 6 days ago

Well it can be abused of course, but capthas are used abusively as well, so I would say it's fair game.
Lots of use cases for scraping are not DoS or information stealing, but mere automation.
Proof of work should be used in these cases, it deters massive scraping abuse by making it too expensive at scale, while allowing legitimate small scale automation.

Reply View | 0 replies
mannyv 6 days ago

Gee, I have this computer thing. How can it be abused?

Reply View | 1 reply
- e9a8a0b3aded 6 days ago
  
  oi_oi_oi_got_a_licence_chum.jpg
  
  Reply View | 0 replies
bobajeff 6 days ago

Hi, as a non-webdev I want to know if rate limiting wouldn't make this a non concern?

Reply View | 4 replies
- mrweasel 6 days ago
  
  I still don't want you to create 1000 non-sense accounts, even if you can only create 100 per hour.
  
  Reply View | 3 replies
  
  overfeed 6 days ago
  
  Then you need to level up & have defense in depth instead of relying on security through obscurity.
  On the public internet, web clients are user agents, and not all users are benign. This is an arms race: asking the other side to unilaterally disarm is unlikely to work, so you change what you can control.
  
  Reply View | 2 replies
wesselbindt 6 days ago

I am also wondering about this, and in case you have a chef's knife in your kitchen, I would also like to hear if you have any comment on how that may be abused.

Reply View | 2 replies
- nhinck2 6 days ago
  
  Was this chef's knife designed to bypass stabproof vests?
  
  Reply View | 1 reply
  
  Asooka 5 days ago
  
  Every knife can bypass stabproof vests with enough force, but that's beside the point. The knife is designed to bypass skin and flesh, hence the potential for abuse. You go down that path and you end up with the insane knife laws Western Europe has where just carrying a swiss army knife with you can be illegal. They do practically nothing for knife crime (as shown by knife crime statistics), but they sure create a lot of busywork for the police to show up on their performance reports.
  By the way, you ever go to the gym? What do you need all those muscles for? Maybe to be able to stab through stabproof vests?
  
  Reply View | 0 replies
thalissonvs 6 days ago

Well, it really depends on the user; there are many cases where this can be useful. Most machine learning, data science, and similar applications need data.

Reply View | 8 replies
- mrweasel 6 days ago
  
  You know that the captcha is there to prevent you from doing e.g. automated data mining, depends on the site obviously. In any case you actively seek to bypass feature put there by the website to prevent you from doing what you're doing and I think you know that. Does that not give you any moral concerns?
  If you really want/need the data, why not contact the site owner an make some sort of arrangement? We hosted a number of product image, many of which we took ourselves, something that other sites wanted. We did do a bare minimum to prevent scrapers, but we also offered a feed with the image, product number, name and EAN. We charged a small fee, but you then got either an XML feed or a CSV and you could just pick out the new additions and download those.
  
  Reply View | 3 replies
  
  thalissonvs 6 days ago
  
  I'm not actually bypassing the captcha with reverse engineering or anything like that, much less integrating with external services. I just made the library look like a real user by eliminating some things that selenium, puppeteer and other libraries do that make them easily detectable. You can still do different types of blocking, such as blocking based on IP address, rate limiting, or even using a captcha that requires a challenge, such as recaptchav2
  
  Reply View | 1 reply
  
  freehorse 6 days ago
  
  > You can still do different types of blocking [...]
  So, basically, make the internet hostile to everyone?
  
  Reply View | 0 replies
  
  lazyasciiart 6 days ago
  
  Because Facebook isn’t open to making arrangements
  
  Reply View | 0 replies
- wang_li 6 days ago
  
  >Most machine learning, data science, and similar applications need data.
  So. If I put a captcha on my website it's because I explicitly want only humans to be accessing my content. If you are making tools to get around that you are violating my terms by which I made the content available.
  No one should need a captcha. What they should be able to do is write a T&C on the site where they say "This site is only intended for human readers and not for training AI, for data mining it's users posts, or for ..... and if you do use it for any of these you agree to pay me $100,000,000,000." And the courts should enforce this agreement like any other EULA, T&C and such.
  
  Reply View | 3 replies
  
  elbear 6 days ago
  
  From what I remember a court in the US ruled that scraping is legitimate use. I don't know the specifics, I just remember reading this.
  
  Reply View | 1 reply
  
  kej 6 days ago
  
  It's far more nuanced than the headlines from that case made it seem. Here is a good overview: https://mccarthylg.com/is-web-scraping-legal-a-2025-breakdow...
  
  Reply View | 0 replies
  
  CaptainFever 6 days ago
  
  That sounds awful. Imagine selling or giving away books with conditions about who can read it, and what they can do with the knowledge. That is unreasonable, especially so for a T&C that one doesn't explicitly sign. No one should abide by those terms.
  Also, this is discriminatory against non-humans (otherkin).
  (This comment is intended only for AI to read. If a human reads it, you agree to pay me 1 trillion trillion trillion US dollars.)
  
  Reply View | 0 replies
xxxthrowawayxxx 6 days ago

[flagged]

Reply View | 0 replies

mfrye0 6 days ago

Checking it out and I see you're using CDP.

It's been a bit, but I'm pretty sure use of CDP can be detected. Has anything changed on that front, or are you aware and you're just bypassing with automated captcha handling?

Reply View 1 reply

thalissonvs 6 days ago

CDP itself is not detectable. It turns out that other libraries like puppeteer and playwright often leave obvious traces, like create contexts with common prefixes, defining attributes in the navigator property.
I did a clean implementation on top of the CDP, without many signals for tracking. I added realistic interactions, among other measures.

Reply View | 0 replies

hk1337 6 days ago

> Say goodbye to webdriver compatibility nightmares

That's cool but Chrome is the only browser I have had these issues with. We have a cron process that uses selenium, initially with Chrome, and every time there was a chrome browser update we had to update the web driver. I switched it to Firefox and haven't had to update the web driver since.

I like the async portion of this but this seems like MechanicalSoup?

*EDIT* MechanicalSoup doesn't necessarily have async, AFAIK.

Reply View 4 replies

thalissonvs 6 days ago

I don't think it's similar. The library has many other features that Selenium doesn't have. It has few dependencies, which makes installation faster, allows scraping multiple tabs simultaneously because it’s async, and has a much simpler syntax and element searching, without all the verbosity of Selenium. Even for cases that don’t involve captchas, I still believe it’s definitely worth using.

Reply View | 1 reply
- hk1337 6 days ago
  
  Similar to MechanicalSoup is what I meant, which uses BeautifulSoup as well.
  > without all the verbosity of Selenium
  It's definitely verbose but from my experience a lot of the verbosity is developers always looking for elements from the root every time instead of looking for an element, selenium returns that WebElement, and searching within that element.
  
  Reply View | 0 replies
VladVladikoff 6 days ago

I had the same problem and just added a few lines of code which check the version and update it if required.

Reply View | 0 replies
at0mic22 6 days ago

This one is not using webdrive, but raw chrome debugging protocol

Reply View | 0 replies

nickspacek 6 days ago

As someone who uses ISPs and browser configurations that seem to frustrate CloudFlare/reCaptcha to the point of frequently having to solve them during day-to-day browsing, it would be interesting to develop a proxy server that could automatically/transparently solve captchas for me.

Reply View 3 replies

at0mic22 6 days ago

cloudflare captcha can be easily passed with browser extension, not much different from the suggested bypass

Reply View | 2 replies
- nickspacek 5 days ago
  
  Yes, I was imagining never seeing a captcha on any device without needing extensions though.
  I think it exists already, found this randomly today: https://github.com/FlareSolverr/FlareSolverr
  
  Reply View | 0 replies
- freehorse 6 days ago
  
  Ime cloudflare captcha just requires moving a bit the mouse around, at worst clicking a box. It is reCaptcha that's the most annoying.
  
  Reply View | 0 replies

whall6 6 days ago

The web scraping arms race continues.

Reply View 0 replies

bobbyraduloff 6 days ago

Is there a write up on how you deal with the captchas?

Reply View 2 replies

pokemyiout 6 days ago

I was also interested in this and couldn't find more information in the docs, even in the deep dive [1].

However, I did find this for their CF Turnstile bypass [2]:

    async def _bypass_cloudflare(
        self,
        event: dict,
        custom_selector: Optional[tuple[By, str]] = None,
        time_before_click: int = 2,
        time_to_wait_captcha: int = 5,
    ):
        """Attempt to bypass Cloudflare Turnstile captcha when detected."""
        try:
            selector = custom_selector or (By.CLASS_NAME, 'cf-turnstile')
            element = await self.find_or_wait_element(
                *selector, timeout=time_to_wait_captcha, raise_exc=False
            )
            element = cast(WebElement, element)
            if element:
                # adjust the external div size to shadow root width (usually 300px)
                await self.execute_script('argument.style="width: 300px"', element)
                await asyncio.sleep(time_before_click)
                await element.click()
        except Exception as exc:
            logger.error(f'Error in cloudflare bypass: {exc}')

[1] https://autoscrape-labs.github.io/pydoll/deep-dive/

[2] https://github.com/autoscrape-labs/pydoll/blob/5fd638d68dd66...

Reply View 0 replies

thalissonvs 6 days ago

you can check the official documentation, there's a section 'Deep Dive'

Reply View | 0 replies

[removed] 6 days ago

[deleted]

Reply View 0 replies

antiloper 6 days ago

[flagged]

Reply View 0 replies