Qwen3-VL can scan two-hour videos and pinpoint nearly every detail
(the-decoder.com)197 points by thm 3 days ago
197 points by thm 3 days ago
you can do that with Morphik already :)
We use an embedding model that processes videos and allows you to perform RAG on them.
It’s not difficult to hack this together with CLIP. I did this with about a tenth of my movie collection last week with a GTX 1080 - though it lacks temporal understanding so you have to do the scene analysis yourself
Does anyone else worry about this technology used for Big Brother type surveillance?
Where have you been the last decade? It’s already in use, or models like it, by companies selling access to The State
Not to mention cloud platforms that collect evidence and process it with all the models and store that information for searching…
Palantir's just the new guy on the block: https://en.wikipedia.org/wiki/Sentient_(intelligence_analysi...
It was already used before current AI explosion.
This is why keeping our governments from eating that tasty apple of "if you can record AND analyse everything there will be so much less crime" and "just give us keys to all private communication, we swear we will just use it to find bad guys". Because someone will, and someone will use it to hit on people they don't like
How do you think this tech was developed in the first place? It's probably trained and used in the surveillance bid for a decade before it comes to consumers, and this probably isn't the SoA stuff that governments have access to, we're probably 5-10 years behind what's on the cutting edge.
I wouldn’t bet. IT innovation used to be lead by the defence industry, but that has changed and now consumer technology is driving the innovation from what I have been told.
I’m sure they have some cool secret stuff, but they are perhaps not 10 years ahead. Also, I find unlikely that those secrets wouldn’t make it to the public society now, as we are probably close the top of the AI bubble.
We got Facial Rec and LPR first, those are more dangerous for surveillance.
In surveillance and police states like The Netherlands it has been used since forever:
https://www.theguardian.com/cities/2018/mar/01/smart-cities-...
Now people will say again that this project has been abandoned, which just isn't true (2024):
https://www.dutchnews.nl/2024/06/smart-street-surveillance-o...
I would be surprised if this hasn't existed for a few decades already.
Back in 2009 I was working at a place where O2 was a client, and they gave us an API that could identify the cell tower (inc. lat/lng) any of their customers were connected to. The network needs to track this data internally to function, so the API is basically the equivalent of their DNS.
Big Brother is a reference to George Orwell's critique of Communism in Nineteen Eighty-Four.
Qwen is a video model trained by a Communist government, or technically by a company with very close ties to the Chinese government. The Chinese government also has laws requiring AI be used to further the political goals of China in particular and authoritarian socialism in general.
In the light of all this, I think it's reasonable to conclude that this technology will be used for Big Brother type surveillance and quite possible that it was created explicitly for that purpose.
Just nitpicking here, but 1984 is a critique of totalitarianism. The only references to systems of government in the book refer to "The German Nazis and the Russian Communists".
Orwell was a democratic socialist. He was opposed to totalitarian politics, not communism per se.
It's true that it's about totalitarianism to some extent. But we have Orwell's actual words here that it's chiefly about communism
> [Nineteen Eighty-Four] was based chiefly on communism, because that is the dominant form of totalitarianism, but I was trying chiefly to imagine what communism would be like if it were firmly rooted in the English speaking countries, and was no longer a mere extension of the Russian Foreign Office.
And of course Animal Farm is only about communism (as opposed to communism + fascism). And the lesser known Homage to Catalonia depicts the communist suppression of other socialist groups.
By all this I just mean to say when you're reading Nineteen Eighty-Four what he's describing is barely a fictionalization of what was already going on in the Soviet Union. There's just not a lot in the book that is specifically Nazi or Fascist.
I don't have any opinion on whether he thought there were non-totalitarian forms of communism.
Not so relevant to the thread but ive been uploading screenshots from citrix guis and asking qwen3-vl for the appropriate next action eg Mouseclick, and while it knows what to click it struggles to accurately return which pixel coordinates to click. Anyone know a way to get accurate pixel coordinates returned?
How do you prompt the model? In my experience, Qwen3-VL models have very accurate grounding capabilities (I’ve tested Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-30B-A3B-Thinking, and Qwen3-VL-235B-A22B-Thinking-FP8).
Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:
```json [ {"bbox_2d": [217, 112, 920, 956], "label": "cat"} ] ```
Here, the values represent [x_min, y_min, x_max, y_max]. To convert these to pixel coordinates, use:
[x_min / 1000 * image_width, y_min / 1000 * image_height, x_max / 1000 * image_width, y_max / 1000 * image_height]
Also, if you’re running the model with vLLM > 0.11.0, you might be hitting this bug: https://github.com/vllm-project/vllm/issues/29595
It’s been about a year since I looked into this sort of thing, but molmo will give you x,y coordinates. I hacked together a project about it. I also think Microsoft’s omniparser is good at finding coordinates too.
https://huggingface.co/allenai/Molmo-7B-D-0924
Could you combine it with a classic OCR segmentation process, so that along with the image you also provide box coordinates of each string?
Also curious about this. I tried https://moondream.ai/ as well for this task and it felt still far from being bulletproof.
you want get the exact coordinated by running a key point network to pinpoint which coordinates does the next click point is you can. here I show a example simple prompt which returns the keypoint location of the next botton to click and visually localize the point with a keypoint in the image
I was playing around with Qwen3-VL to parse PDFs - meaning, do some OCR data extraction from a reasonably well-formated PDF report. Failed miserably, although I was using the 30B-A3B model instead of the larger one.
I like the Qwen models and use them for other tasks successfully. It is so interesting how LLMs will do quite well in one situation and quite badly in another.
The opus models seems pretty adept and extracting structured data from ocr https://www.ocrarena.ai/battle
I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:
link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52
Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo
I found it pretty funny how bad Claude was at cropping an image. It was a cute little character with some text off to the side on a white background, all very clean cartoon vibes and it COULD NOT just select the character. I pursued it for 20 minutes because I thought it was funny. Of course it was 45 seconds to do it myself.
A lot of my side projects involve UIs and almost all of my problems with getting LLMs to write them for me involve "The UI isn't doing what you say it's doing" and struggling to get A) a reliable way to get it to look at the UI so it can continue its loop and B) getting it to understand what it's looking at well enough to do something about it
I agree claude and chatgpt and even gemini does a poor job in detecting and cropping into a region. Some of the simplest tasks, Qwen also is great at summerization but not into solving simple vision tasks like cropping, segmentetation and detection. Here is an examples where we compared claude, gemini, chatgpt and other frontier models for simple(and complicated) visual tasks https://chat.vlm.run/showdown#:~:text=Crop%20into%20the%20cl...
The part that was funny to me is I would respond "is that right?" and it would tell me exactly how it was wrong and proceed to do it incorrectly again in a very similar but different way. It was like a Monty Python sketch. I might have also been very tired and easily amused.
Still not great at the use cases I tested it for but Gemini isn't either. I think we're still very early on video comprehension.
anyone have a tl;dr for me on what the best way to get the video comprehension stuff going is? i use qwen-30b-vl all the time locally as my goto model because it's just so insanely fast, curious to mess with the video stuff, the vision comprehension works great and i use it for OCR and classification all the time
Ive used qwen3-VL on deepwalker lately. All I can stay is that this model is so underrated.
It's so weird how that works with transformers.
Finetuning an LLM "backbone" (if I understand correctly: a fully trained but not instruction tuned LLM, usually small because students) with OCR tokens bests just about every OCR network out there.
And it's not just OCR. Describing images. Bounding boxes. Audio, both ASR and TTS, all works better that way. Now many research papers are only really about how to encode image/audio/video to feed it into a Llama or Qwen model.
It is fascinating. Vision language models are unreasonably good compared to dedicated OCR and even the language tasks to some extent.
My take is it fits into the general concept that generalist models have significant advantages because so much more latent structure maps across domains than we expect. People still talk about fine tuning dedicated models being effective but my personal experience is it's still always better to use a larger generalist model than a smaller fine tuned one.
>People still talk about fine tuning dedicated models being effective
>it's still always better to use a larger generalist model than a smaller fine tuned one
Smaller fine-tuned models are still a good fit if they need to run on-premises cheaply and are already good enough. Isn't it their main use case?
> The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze.
This seems to be somewhat unwise. Such an insertion would qualify as an anomaly. And if it's also trained that way, would you not train the model to find artificial frames where they don't belong?
Would it not have been better to find a set of videos where something specific (common, rare, surprising, etc) happens at some time and ask the model about that?