Comment by spherelot
How do you prompt the model? In my experience, Qwen3-VL models have very accurate grounding capabilities (I’ve tested Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-30B-A3B-Thinking, and Qwen3-VL-235B-A22B-Thinking-FP8).
Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:
```json [ {"bbox_2d": [217, 112, 920, 956], "label": "cat"} ] ```
Here, the values represent [x_min, y_min, x_max, y_max]. To convert these to pixel coordinates, use:
[x_min / 1000 * image_width, y_min / 1000 * image_height, x_max / 1000 * image_width, y_max / 1000 * image_height]
Also, if you’re running the model with vLLM > 0.11.0, you might be hitting this bug: https://github.com/vllm-project/vllm/issues/29595
Will give this a go, cheers :)