Comment by chhxdjsj

Comment by chhxdjsj 8 hours ago

Not so relevant to the thread but ive been uploading screenshots from citrix guis and asking qwen3-vl for the appropriate next action eg Mouseclick, and while it knows what to click it struggles to accurately return which pixel coordinates to click. Anyone know a way to get accurate pixel coordinates returned?

spherelot 5 hours ago

How do you prompt the model? In my experience, Qwen3-VL models have very accurate grounding capabilities (I’ve tested Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-30B-A3B-Thinking, and Qwen3-VL-235B-A22B-Thinking-FP8).

Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:

```json [ {"bbox_2d": [217, 112, 920, 956], "label": "cat"} ] ```

Here, the values represent [x_min, y_min, x_max, y_max]. To convert these to pixel coordinates, use:

[x_min / 1000 * image_width, y_min / 1000 * image_height, x_max / 1000 * image_width, y_max / 1000 * image_height]

Also, if you’re running the model with vLLM > 0.11.0, you might be hitting this bug: https://github.com/vllm-project/vllm/issues/29595

Reply View 1 reply

chhxdjsj 5 hours ago

Will give this a go, cheers :)

Reply View | 0 replies

logankeenan 7 hours ago

It’s been about a year since I looked into this sort of thing, but molmo will give you x,y coordinates. I hacked together a project about it. I also think Microsoft’s omniparser is good at finding coordinates too.

https://huggingface.co/allenai/Molmo-7B-D-0924

https://github.com/logankeenan/george

https://github.com/microsoft/OmniParser

Reply View 1 reply

chhxdjsj 5 hours ago

Thanks ill try this!

Reply View | 0 replies

jazzyjackson 8 hours ago

Could you combine it with a classic OCR segmentation process, so that along with the image you also provide box coordinates of each string?

Reply View 0 replies

hamasho 7 hours ago

It's very not accurate, but sometimes instructing to return pyautogui code works.

  prompt: I attach a screenshot (1920x1080). Write code to click the submit button using pyautogui.
  attachment: <screenshot>
  reply:
    import pyautogui
    pyautogui.click(100, 200)

Reply View 1 reply

chhxdjsj 5 hours ago

Ive been asking for pyautogui output already but it is still very hit and miss

Reply View | 0 replies

8f2ab37a-ed6c 8 hours ago

Also curious about this. I tried https://moondream.ai/ as well for this task and it felt still far from being bulletproof.

Reply View 0 replies

visioninmyblood 7 hours ago

you want get the exact coordinated by running a key point network to pinpoint which coordinates does the next click point is you can. here I show a example simple prompt which returns the keypoint location of the next botton to click and visually localize the point with a keypoint in the image

https://chat.vlm.run/c/e12f0153-7121-4599-9eb9-cd8c60bbbd69

Reply View 0 replies