Comment by ygouzerh
The rate of progress on multimodal agents is impressive. OpenVLA was released in June 2024 and was state of the art at that time... 8 months later, on tasks like "Pick Place Hotdog Sausage" the success rate is passing from 2/10 to 6/10
"Pick Place Hotdog Sausage" is such a bizarre name, though. Is it meant to be human readable? AI-readable? Just a label for the researchers? Same with "Put Mushroom Place Pot". As far as I can see both labels are only used in this Magma paper, nowhere else that Google can find.