Comment by dataviz1000

Comment by dataviz1000 5 months ago

You might be able to reduce the amount of information sent to the LLM by 100 fold if you use a stacking context. Here is an example of one made available on Github (not mine). [0] Moreover, you will be able to parse the DOM or have strategies that parse the DOM. For example, if you are only concerned with video, find all the videos and only send that information. Perhaps parsing a page once finding the structure and caching that so the next time only the required data is used. (I see you are storing tool sequence but I didn't find an example of storing a DOM structure so that requests to subsequent pages are optimized.)

If someone visits my website that I control using your Chrome Extension, I will 100% be able to find a way to drain all their accounts probably in the background without them even knowing. Here are some ideas about how to mitigate that.

The problem with Playwright is that it requires Chrome DevTools Protocol (CDP) which opens massive security problems for a browser that people use for their banking and managing anything that involves credit cards are sensitive accounts. At one point, I took the injected folder out of Playwright and injected it into a Chrome Extension because I thought I needed its tools, however, I quickly abandoned it as it was easy to create workflows from scratch. You get a lot of stuff immediately by using Playwright but likely you will find it will be much lighter and safer to just implement that functionality by yourself.

The only benefit of CDP for normal use is allowing automation of any action in the Chrome Extension that requires trusted events, e.g. play sound, go fullscreen, banking websites what require trusted event to transfer money. I'm my opinion, people just want a large part of the workflow automated and don't mind being prompted to click a button when trusted events are required. Since it doesn't matter what button is clicked you can inject a big button that says continue or what is required after prompting the user. Trusted events are there for a reason.

[0] https://github.com/andreadev-it/stacking-contexts-inspector

parsabg 5 months ago

I will look into this. Speed and inefficiency due to the low information density of raw DOM tokens is the single biggest issue for this type of thing right now.

Reply View 3 replies

dataviz1000 5 months ago

Here are some ideas on how to cache selectors for reuse and get all the text to use with full text search to find clickable elements slow but still faster than a round trip to a LLM. [0] These are very naive but that is the only place there is money doing this. If you create 100 of these optimizations like only selecting visible selectors or selectors that contain video if the context is video you can greatly limit the amount of the useless data being sent to the LLM.
[0] https://chatgpt.com/c/682a2edf-e668-8004-a8ce-568d5dd0ec1c

Reply View | 2 replies
- parsabg 5 months ago
  
  The link doesn't load for me. Can you try sharing again?
  
  Reply View | 1 reply
  
  dataviz1000 5 months ago
  
  I tried to make it cleaner and organized with code and then output but I think I made it worse without explaining what or why. [0] Sorry. It is only some examples of how to query the DOM to isolate the most important information.
  I'm not definite (I'm supposed to be working on something else sorry if I'm wrong here), however, I believe this is the code Browser Use uses for stacking context including piercing the shadow DOM. [1] Because they build a map with all the visible elements, they can inject different color borders around them. Here they test for the topmost elements in the viewport. [2]
  [0] https://chatgpt.com/share/682a68bf-c6a0-8004-9c20-15508e6b3b...
  [1] https://github.com/browser-use/browser-use/blob/55d078ed5a49...
  [2] https://github.com/browser-use/browser-use/blob/55d078ed5a49...
  
  Reply View | 0 replies

kanzure 5 months ago

possibly something like https://github.com/romansky/dom-to-semantic-markdown could also help for this use case.

Reply View 2 replies

dataviz1000 5 months ago

That is awesome. A list of power tools on Amazon went from 2.5MB of HTML to 236KB of markup. That is huge! Wow, thank you for sharing.
This is half the equation. Also, lot of the information in the markup can be used to query elements to interact with because it keeps the link locations which can be used to navigate or select elements. On the other hand, by using the stacking context, it is possible query only elements that are visible which removes all elements that can't be interacted with.

Reply View | 0 replies
parsabg 5 months ago

Looks powerful at least for read only use cases. Will have a look and compare token stats. Thanks

Reply View | 0 replies