Comment by krick

Comment by krick 10 months ago

Does anyone know solid (not SaaS, obviously) solution for scraping these days? It's getting pretty hard to get around some pretty harmless cases (like bulk-downloading MY OWN gpx tracks from some fucking fitness-watch servers), with all these js tricks, countless redirects, cloudflare and so on. Even if you already have the cookies, getting non-403 response to any request is very much not trivial. I feel like it's time to upgrade my usual approach of python requests+libxml, but I don't know if there is a library/tool that solves some of the problems for you.

_boffin_ 10 months ago

- launch chrome with loading of specified data dir.

- connect to it remotely

- ghost cursor and friends

- save cookies and friends to data dir

- run from residential ip

- if get served captcha or cloudflare, direct to solver and to then route back.

- mobile ip if possible

…can’t go into anymore specifics than that

…I forget the site right now, but there a guy that gives a good rundown of this stuff. I’ll see id I can find it.

Reply View 4 replies

mhuffman 10 months ago

I would be interesting if you can find it.

Reply View | 3 replies
- _boffin_ 10 months ago
  
  https://antoinevastel.com
  
  Reply View | 2 replies
  
  mhuffman 10 months ago
  
  Thanks!
  
  Reply View | 1 reply
  
  _boffin_ 10 months ago
  
  Also keep the following in mind:
  If you were to use an automated browser, such as puppeteer / playwright: - People don't move mouses in "straight" lines.
  - People don't click on things that are out of viewport.
  - Check the permissions you give sites.
  Additional info:
  - https://stackoverflow.com/questions/57987585/puppeteer-how-t...
  - Look into connecting with CDP.
  
  Reply View | 0 replies

thealchemi1st 10 months ago

You can give the open-source tools mentioned in this guide a look: https://scrapfly.io/blog/how-to-scrape-without-getting-block...

Reply View 0 replies

sebmellen 10 months ago

https://browserless.io might be what you’re looking for. Open source although they do have a SaaS option.

Reply View 0 replies

djbusby 10 months ago

I use a few things. First, I scrape from my home IP at very low rates. I drive either FF or Chrome using extension. Sometimes I have to start the session manually (not a robot) and then engage the crawler. Sometimes, site dependant, can run headless or puppeteer. But the extension in "normal" browser that goes slow has been working great for me.

It seems that some sites can determine when using headless or web-driver enabled profile.

Sometimes I'm through a VPN.

The automation is the easy part.

Reply View 0 replies

_boffin_ 10 months ago

Heads up, requests adds some extra headers on send.

One thing I’ve also been doing recently when I find a site that I just want an api is just use python and execute a curl via python. I populate the curl from chrome’s network tab. I also have a purpose built extension I have in my browser that saves cookies to a lan Postgres DB and then the use those values for the script.

Can even probably do more by automating the browser to navigate there on failure.

Reply View 0 replies

kfrzcode 10 months ago

https://github.com/yifeikong/curl_cffi

Reply View 0 replies

bobbylarrybobby 10 months ago

On a Mac, I use keyboard maestro, which can interact with the UI (which is usually stable enough to form an interface of sorts) — wait for an graphic to appear on screen, then click it, then simulate keystrokes, run JavaScript on the current page and get a result back... looks very human to a website in a browser, and is nearly as easy to write as Python.

Reply View 0 replies

iansinnott 10 months ago

In short: Don't use HTML endpoints, use APIs.

This is not always possible, but if the product in question has a mobile app or a wearable talking to a server, you might be able to utilize the same API it's using:

- intercept requests from the device - find relevant auth headers/cookies/params - use that auth to access the API

Reply View 0 replies

whilenot-dev 10 months ago

If requests solves any 403 headaches for you, just pass the session cookies to a playwright instance, and you should be good to go. Just did that for scraping the SAP Software Download Center.

Reply View 0 replies

lambdaba 10 months ago

I've found selenium with undetected-chromedriver to work best.

Reply View 2 replies

unsupp0rted 10 months ago

Doesn't get around Cloudflare's anti-bot

Reply View | 1 reply
- lambdaba 10 months ago
  
  Ah, ok, I found it worked with YouTube unlike regular chromedriver, didn't encounter Cloudflare when I used it
  
  Reply View | 0 replies