Comment by Animats

Comment by Animats 13 hours ago

6 replies

> From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.

Right. I was thinking about this back in the 1990s. That resulted in a years-long detour through collision detection, physically based animation, solving stiff systems of nonlinear equations, and a way to do legged running over rough terrain. But nothing like "AI". More of a precursor to the analytical solutions of the early Boston Dynamics era.

Work today seems to throw vast amounts of compute at the problem and hope a learning system will come up with a useful internal representation of the spatial world. It's the "bitter lesson" approach. Maybe it will work. Robotic legged locomotion is pretty good now. Manipulation in unstructured situations still sucks. It's amazing how bad it is. There are videos of unstructured robot manipulation from McCarthy's lab at Stanford in the 1960s. They're not that much worse than videos today.

I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.

There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?

Earw0rm 11 hours ago

I share your surprise regarding LLM, is it fair to say that it's because language - especially formalised, written language - is a self-describing system.

A machine can infer the right (or expected) answer based on data, I'm not sure that the same is true for how living things navigate the physical world - the "right" answer, such as one exists for your squirrel, is arguably Darwinian: "whatever keeps the little guy alive today".

nosianu 6 hours ago

> I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.

"AI" is not based on physical real world data and models like our brain. Instead, we chose to analyze human formal (written) communication. ("formal": actual face to face communication has tons of dimensions adding to the text representation of what is said, from tone, speed to whole body and facial expressions)

Bio-brains have a model based on physical sensor data first and go from there, that's completely missing from "AI".

In hindsight, it's not surprising, we skipped that hard part (for now?). Working with symbols is what we've been doing with IT for a long time.

I'm not sure going all out on trying to base something on human intelligence, i.e. human neuro networks, is a winning move. I see it as if we had been trying to create airplanes that flap their wings. For one, human intelligence already exists, and when you lean back and manage to look at how we do on small and large problems from an outside perspective it has plenty of blind spots and disadvantages.

I'm afraid if we were to manage a hundred percent human level intelligence AI we will be disappointed. Sure, it will be able to do a lot, but in the end, nothing we don't already have.

Right now that would also just be the abstract parts, I think the "moving the body" physical parts in relation to abstract commands would be the far more interesting part, but since current AI is not about using physical sensor data at all, never mind combining it with the abstract stuff...

  • nycdatasci 3 hours ago

    You seem to be suggesting that current frontier models are only trained on text and not "sensor data". Multi-modal models are trained on the entire internet + vast amounts of synthetic data. Images and videos are key inputs. Camera sensors are capable of capturing much more "sensor data" than the human eye. Neural networks are the worst way to model intelligence, except all other models.

    You may find this talk enlightening: https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023...

    • nosianu an hour ago

      > You seem to be suggesting

      As soon as you start a response like that you should just stop. After all, this is written communication, and what I wrote is plain to see right there.

      When you need to start a response that way you should become self-aware that you are not responding to what the person you respond to wrote, but to your own ideas.

      There is no need to "interpret" what other people wrote.

      Relevant: https://i.imgur.com/Izrqp7d.jpeg

imtringued 3 hours ago

>There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?

https://www.youtube.com/watch?v=udPY5rQVoW0

This has been a thing for a while. It's actually a funny way to demonstrate model based control by replacing the controller with a human.