Comment by visioninmyblood
Comment by visioninmyblood 15 hours ago
I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:
link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52
Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo
I found it pretty funny how bad Claude was at cropping an image. It was a cute little character with some text off to the side on a white background, all very clean cartoon vibes and it COULD NOT just select the character. I pursued it for 20 minutes because I thought it was funny. Of course it was 45 seconds to do it myself.
A lot of my side projects involve UIs and almost all of my problems with getting LLMs to write them for me involve "The UI isn't doing what you say it's doing" and struggling to get A) a reliable way to get it to look at the UI so it can continue its loop and B) getting it to understand what it's looking at well enough to do something about it