Comment by palmfacehn

Comment by palmfacehn 6 months ago

My impression is that there's less effort for them to go directly to headless browsers. There are several foot guns in using a raw HTML parsing lib and dispatching HTTP requests. People don't care about resource usage, spammers even less and many of them lack the skills.

marginalia_nu 6 months ago

Most black hat spammers use botnets, especially against bigger targets which have enough traffic to build statistics to fingerprint clients and map out bad ASNs and so on, and most botnets are low powered. You're not running chrome on a smart fridge or an enterprise router.

Reply View 3 replies

desdenova 6 months ago

Chrome is probably the worst browser possible to run for these things, so it's not the basis for comparison.
We have many smaller browsers, that run javascript, that work on low powered devices as well.
Starting from webkit and stripping down the rendering parts just to execute JavaScript and process the DOM, the RAM usage would be significantly lower.

Reply View | 0 replies
gnfargbl 6 months ago

True, but the bad actor's code doesn't typically run directly on the infected device. Typically the infected router or camera is just acting as a proxy.

Reply View | 1 reply
- mike_hearn 6 months ago
  
  There are ways to detect that and it will still require a lot of CPU and ram behind the proxies.
  
  Reply View | 0 replies

supriyo-biswas 6 months ago

A major player in this space is apparently looking for people experienced in scraping without using browser automation. My guess is that not running a browser results in using far fewer resources, thus reducing their costs heavily.

Running a headless browser also means that any differences in the headless environment vs. a "headed" one can be discovered, as well as any of your Javascript executing within the page, which significantly makes it difficult to scale your operation.

Reply View 6 replies

marginalia_nu 6 months ago

My experience is that headless browsers use about 100x more RAM, and at least 10x more bandwidth and 10x more processing power, and page loads take about 10x as long time to finish (vs curl). Though these numbers may be a bit low, there are instances you need to add another zero to one or more of them.
There's also considerably more jank with headless browsers, since you typically want to re-use instances to avoid incurring the cost of spawning a new browser for each retrieval.

Reply View | 4 replies
- lozenge 6 months ago
  
  Is it possible to pause a VM just after the browser has started up? Then map it as copy-on-write memory and spin up many VMs from that "image".
  
  Reply View | 2 replies
  
  supriyo-biswas 6 months ago
  
  Your comment is interesting and there are some people doing work on this although not specific to browser automation, e.g. AWS Lambda SnapStart is just them trying to boot your Java Lambda code and freeze the Firecracker MicroVM's snapshot and then starting other Lambda functions from there.
  However, even with a VM approach, you tend to lose out on the fact that you can make 100s or 1000s of requests on a small box (~512 MB) every second if it's just restricted to HTTP(s). However, once you're booting up a headless browser, you're probably restricted to no more than loading 3-4 pages per second.
  
  Reply View | 0 replies
  
  marginalia_nu 6 months ago
  
  ... but then you have even larger overhead, as well as the added layer of complexity from managing VMs on top of headless browsers.
  
  Reply View | 0 replies
- palmfacehn 6 months ago
  
  On the other hand you need to be able to do basics like match the headers, sometimes request irrelevant resources, handle malformed documents, catch changing form parameters, and other gotchas. Many would just copy the request from the browser console.
  
  Reply View | 0 replies
fweimer 6 months ago

The change rate for Chromium is also so high that it's hard to spot the addition of code targeting whatever you are doing on the client side.

Reply View | 0 replies

victorbjorklund 6 months ago

so much more expensive and slow vs just scraping the html. It is not hard to scrape raw html if the target is well-defined (like google).

Reply View 0 replies