Comment by parsabg
I will look into this. Speed and inefficiency due to the low information density of raw DOM tokens is the single biggest issue for this type of thing right now.
I will look into this. Speed and inefficiency due to the low information density of raw DOM tokens is the single biggest issue for this type of thing right now.
I tried to make it cleaner and organized with code and then output but I think I made it worse without explaining what or why. [0] Sorry. It is only some examples of how to query the DOM to isolate the most important information.
I'm not definite (I'm supposed to be working on something else sorry if I'm wrong here), however, I believe this is the code Browser Use uses for stacking context including piercing the shadow DOM. [1] Because they build a map with all the visible elements, they can inject different color borders around them. Here they test for the topmost elements in the viewport. [2]
[0] https://chatgpt.com/share/682a68bf-c6a0-8004-9c20-15508e6b3b...
[1] https://github.com/browser-use/browser-use/blob/55d078ed5a49...
[2] https://github.com/browser-use/browser-use/blob/55d078ed5a49...
Here are some ideas on how to cache selectors for reuse and get all the text to use with full text search to find clickable elements slow but still faster than a round trip to a LLM. [0] These are very naive but that is the only place there is money doing this. If you create 100 of these optimizations like only selecting visible selectors or selectors that contain video if the context is video you can greatly limit the amount of the useless data being sent to the LLM.
[0] https://chatgpt.com/c/682a2edf-e668-8004-a8ce-568d5dd0ec1c