Comment by lelag
Really interesting model, I'm looking forward to play with it.
But what I want is a multimodal agent model capable of generating embeddings for a humanoid control model like Meta motivo[0] rather than directly outputting coordinates.
Meta motivo is still a toy model, trained on the SMPL skeleton, which lacks fingers which limits its capabilities beside having some fun with it. They could have used a more advanced based model, SMPL-X, which includes fingers, but there isn’t enough open motion data with precise finger motion to train a robust manipulation model anyway.
Most existing motion datasets come from academic motion capture setups, which are complex and not focused on manipulation tasks (and also pretty old). I believe this gap will be filled by improvements in 3D HPE from 2D video. With access to thousands of hours of video, we can build large-scale motion datasets covering a wide range of real-world interactions.
This will enable training the two components needed for dexterous humanoid robots: the agentic model that decides what actions to take and generates embeddings that can be read by a control model that accurately models hand and finger joint movement.
Given the rapid progress in the capabilities of SoTA 3D HPE from 2D video, and the vast amount of videos online (Youtube), I expect we will see humanoid robots with good manipulation capabilities it the not so distant future.
Some more thoughts about training a manipulation model: I would add that synthetic data might be key to making it happen.
One issue is that most video is not shot in first person, so it might make for a poor dataset for the agentic part assuming the robot has human like vision.
Still if you have a large data set of motion capture data with reasonably accurate finger mouvement, you could use a video diffusion model with a control net to get a realistic looking video of a specific motion in first person. Another way would be to use a model like dust3r to generate a geometric 3d scene from the initial video allowing to change the camera angle to match a first person view.
This could be used as the dataset for the agentic model.
Now, maybe human like vision is not even necessary, unlike human, there is nothing preventing your robot to see through external camera placed around the house. Hell, there's even a good chance, your robot's brain will live in a datacenter hundreds of mile away.