Comment by palmfacehn

Comment by palmfacehn 18 hours ago

12 replies

My impression is that there's less effort for them to go directly to headless browsers. There are several foot guns in using a raw HTML parsing lib and dispatching HTTP requests. People don't care about resource usage, spammers even less and many of them lack the skills.

marginalia_nu 18 hours ago

Most black hat spammers use botnets, especially against bigger targets which have enough traffic to build statistics to fingerprint clients and map out bad ASNs and so on, and most botnets are low powered. You're not running chrome on a smart fridge or an enterprise router.

  • gnfargbl 17 hours ago

    True, but the bad actor's code doesn't typically run directly on the infected device. Typically the infected router or camera is just acting as a proxy.

    • mike_hearn 16 hours ago

      There are ways to detect that and it will still require a lot of CPU and ram behind the proxies.

  • desdenova 16 hours ago

    Chrome is probably the worst browser possible to run for these things, so it's not the basis for comparison.

    We have many smaller browsers, that run javascript, that work on low powered devices as well.

    Starting from webkit and stripping down the rendering parts just to execute JavaScript and process the DOM, the RAM usage would be significantly lower.

supriyo-biswas 18 hours ago

A major player in this space is apparently looking for people experienced in scraping without using browser automation. My guess is that not running a browser results in using far fewer resources, thus reducing their costs heavily.

Running a headless browser also means that any differences in the headless environment vs. a "headed" one can be discovered, as well as any of your Javascript executing within the page, which significantly makes it difficult to scale your operation.

  • marginalia_nu 18 hours ago

    My experience is that headless browsers use about 100x more RAM, and at least 10x more bandwidth and 10x more processing power, and page loads take about 10x as long time to finish (vs curl). Though these numbers may be a bit low, there are instances you need to add another zero to one or more of them.

    There's also considerably more jank with headless browsers, since you typically want to re-use instances to avoid incurring the cost of spawning a new browser for each retrieval.

    • lozenge 17 hours ago

      Is it possible to pause a VM just after the browser has started up? Then map it as copy-on-write memory and spin up many VMs from that "image".

      • supriyo-biswas 17 hours ago

        Your comment is interesting and there are some people doing work on this although not specific to browser automation, e.g. AWS Lambda SnapStart is just them trying to boot your Java Lambda code and freeze the Firecracker MicroVM's snapshot and then starting other Lambda functions from there.

        However, even with a VM approach, you tend to lose out on the fact that you can make 100s or 1000s of requests on a small box (~512 MB) every second if it's just restricted to HTTP(s). However, once you're booting up a headless browser, you're probably restricted to no more than loading 3-4 pages per second.

      • marginalia_nu 14 hours ago

        ... but then you have even larger overhead, as well as the added layer of complexity from managing VMs on top of headless browsers.

    • palmfacehn 17 hours ago

      On the other hand you need to be able to do basics like match the headers, sometimes request irrelevant resources, handle malformed documents, catch changing form parameters, and other gotchas. Many would just copy the request from the browser console.

  • fweimer 17 hours ago

    The change rate for Chromium is also so high that it's hard to spot the addition of code targeting whatever you are doing on the client side.

victorbjorklund 17 hours ago

so much more expensive and slow vs just scraping the html. It is not hard to scrape raw html if the target is well-defined (like google).