Comment by krick

Comment by krick 2 days ago

16 replies

Does anyone know solid (not SaaS, obviously) solution for scraping these days? It's getting pretty hard to get around some pretty harmless cases (like bulk-downloading MY OWN gpx tracks from some fucking fitness-watch servers), with all these js tricks, countless redirects, cloudflare and so on. Even if you already have the cookies, getting non-403 response to any request is very much not trivial. I feel like it's time to upgrade my usual approach of python requests+libxml, but I don't know if there is a library/tool that solves some of the problems for you.

_boffin_ 2 days ago

- launch chrome with loading of specified data dir.

- connect to it remotely

- ghost cursor and friends

- save cookies and friends to data dir

- run from residential ip

- if get served captcha or cloudflare, direct to solver and to then route back.

- mobile ip if possible

…can’t go into anymore specifics than that

…I forget the site right now, but there a guy that gives a good rundown of this stuff. I’ll see id I can find it.

djbusby 2 days ago

I use a few things. First, I scrape from my home IP at very low rates. I drive either FF or Chrome using extension. Sometimes I have to start the session manually (not a robot) and then engage the crawler. Sometimes, site dependant, can run headless or puppeteer. But the extension in "normal" browser that goes slow has been working great for me.

It seems that some sites can determine when using headless or web-driver enabled profile.

Sometimes I'm through a VPN.

The automation is the easy part.

_boffin_ 2 days ago

Heads up, requests adds some extra headers on send.

One thing I’ve also been doing recently when I find a site that I just want an api is just use python and execute a curl via python. I populate the curl from chrome’s network tab. I also have a purpose built extension I have in my browser that saves cookies to a lan Postgres DB and then the use those values for the script.

Can even probably do more by automating the browser to navigate there on failure.

iansinnott 2 days ago

In short: Don't use HTML endpoints, use APIs.

This is not always possible, but if the product in question has a mobile app or a wearable talking to a server, you might be able to utilize the same API it's using:

- intercept requests from the device - find relevant auth headers/cookies/params - use that auth to access the API

bobbylarrybobby 2 days ago

On a Mac, I use keyboard maestro, which can interact with the UI (which is usually stable enough to form an interface of sorts) — wait for an graphic to appear on screen, then click it, then simulate keystrokes, run JavaScript on the current page and get a result back... looks very human to a website in a browser, and is nearly as easy to write as Python.

whilenot-dev 2 days ago

If requests solves any 403 headaches for you, just pass the session cookies to a playwright instance, and you should be good to go. Just did that for scraping the SAP Software Download Center.

lambdaba 2 days ago

I've found selenium with undetected-chromedriver to work best.

  • unsupp0rted 2 days ago

    Doesn't get around Cloudflare's anti-bot

    • lambdaba 2 days ago

      Ah, ok, I found it worked with YouTube unlike regular chromedriver, didn't encounter Cloudflare when I used it