Comment by toisanji

Comment by toisanji 17 hours ago

23 replies

From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.

There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience. Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information. I wrote this before the huge LLM explosion and I still personally believe it is the path forward.

Animats 11 hours ago

> From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.

Right. I was thinking about this back in the 1990s. That resulted in a years-long detour through collision detection, physically based animation, solving stiff systems of nonlinear equations, and a way to do legged running over rough terrain. But nothing like "AI". More of a precursor to the analytical solutions of the early Boston Dynamics era.

Work today seems to throw vast amounts of compute at the problem and hope a learning system will come up with a useful internal representation of the spatial world. It's the "bitter lesson" approach. Maybe it will work. Robotic legged locomotion is pretty good now. Manipulation in unstructured situations still sucks. It's amazing how bad it is. There are videos of unstructured robot manipulation from McCarthy's lab at Stanford in the 1960s. They're not that much worse than videos today.

I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.

There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?

  • Earw0rm 10 hours ago

    I share your surprise regarding LLM, is it fair to say that it's because language - especially formalised, written language - is a self-describing system.

    A machine can infer the right (or expected) answer based on data, I'm not sure that the same is true for how living things navigate the physical world - the "right" answer, such as one exists for your squirrel, is arguably Darwinian: "whatever keeps the little guy alive today".

  • nosianu 5 hours ago

    > I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.

    "AI" is not based on physical real world data and models like our brain. Instead, we chose to analyze human formal (written) communication. ("formal": actual face to face communication has tons of dimensions adding to the text representation of what is said, from tone, speed to whole body and facial expressions)

    Bio-brains have a model based on physical sensor data first and go from there, that's completely missing from "AI".

    In hindsight, it's not surprising, we skipped that hard part (for now?). Working with symbols is what we've been doing with IT for a long time.

    I'm not sure going all out on trying to base something on human intelligence, i.e. human neuro networks, is a winning move. I see it as if we had been trying to create airplanes that flap their wings. For one, human intelligence already exists, and when you lean back and manage to look at how we do on small and large problems from an outside perspective it has plenty of blind spots and disadvantages.

    I'm afraid if we were to manage a hundred percent human level intelligence AI we will be disappointed. Sure, it will be able to do a lot, but in the end, nothing we don't already have.

    Right now that would also just be the abstract parts, I think the "moving the body" physical parts in relation to abstract commands would be the far more interesting part, but since current AI is not about using physical sensor data at all, never mind combining it with the abstract stuff...

    • nycdatasci 2 hours ago

      You seem to be suggesting that current frontier models are only trained on text and not "sensor data". Multi-modal models are trained on the entire internet + vast amounts of synthetic data. Images and videos are key inputs. Camera sensors are capable of capturing much more "sensor data" than the human eye. Neural networks are the worst way to model intelligence, except all other models.

      You may find this talk enlightening: https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023...

  • imtringued 2 hours ago

    >There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?

    https://www.youtube.com/watch?v=udPY5rQVoW0

    This has been a thing for a while. It's actually a funny way to demonstrate model based control by replacing the controller with a human.

Marshferm an hour ago

To decipher if there is anything like spatial intelligence, which is an oxymoronic term at most and redundant at least, one has to decipher the base units of the processes prior to their materialization in the allocortex. And to assign a careful concatenated/parametric categorization of what is unitized, where the processes focus into thresholds etc. This frontier propaganda and the few arvix/nature papers here are too synthetic to lead anywhere of merit.

bonsai_spool 17 hours ago

> Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.

Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.

https://www.nobelprize.org/prizes/medicine/2014/press-releas...

  • Marshferm 17 hours ago

    It's not enough by a long shot. Placement isn't related directly to vicarious trial and error, path integrations, sequence generation.

    There's a whole giant gap between grid cells and intelligence.

    • teleforce 11 hours ago

      >There's a whole giant gap between grid cells and intelligence.

      Please check this recent article on the state machine in the hippocampus based on learning [1]. The findings support the long-standing proposal that sparse orthogonal representations are a powerful mechanism for memory and intelligence.

      [1] Learning produces an orthogonalized state machine in the hippocampus:

      https://www.nature.com/articles/s41586-024-08548-w

      • Marshferm 4 hours ago

        Of course, but the mechanisms “remain obscure”. The entorhinal cortex is but a facet of this puzzle and placement vs head direction etc must be understood beyond mere prediction. There are too many essential parts that are not understood particularly senses and emotion which play the tinkering precursors to evolutionary function that are excluded now as well as the likelyhood that prediction error and prediction are but mistaken precursor computational bottlenecks to unpredictability. Pushing AI into the 4% of a process materially identified as entorhinal is way premature.

        This approach simply follows suit with the blundering reverse engineering of the brain in cog sci where material properties are seen in isolation and processes are deduced piecemeal. The brain can only be understood as a whole first. See rhythms of the brain or unlocking the brain.

        There’s a terrifying lack of curiosity in the paper you posted, a kind of smug synthetic rush to import code into a part of the brain that’s a directory among directories that has redundancies as a warning: we get along without this.

        Your and their view (OSM) is too narrow. eg categorization is baked into the whole brain. How? This is one of 1000s of processes that generalize materially across the entire brain. Isolating "learning" to the allocortex is incredibly misleading.

        https://www.cell.com/current-biology/fulltext/S0960-9822(25)...

hliyan 6 hours ago

I kept reading, waiting for a definition of spatial intelligence, but gave up after a few paragraphs. After years of reading VC-funded startup fluff, writing that contain these words tend to put me off now: transform, revolutionize, next frontier, North Star.

juliangamble 8 hours ago

Thanks for your article. The references section was interesting.

I'll add to the discussion a 2018 Nature letter: "Vector-based navigation using grid-like representations in artificial agents" https://www.nature.com/articles/s41586-018-0102-6

and a 2024 Nature article "Modeling hippocampal spatial cells in rodents navigating in 3D environments" https://www.nature.com/articles/s41598-024-66755-x

And a simulation in Github from 2018 https://github.com/google-deepmind/grid-cells

People have been looking at spacial awareness in neurology for quite a while. (In terms of the timeframe of recent developments in LLMs).

ACCount37 3 hours ago

The question, as always, is: can we get any useful insights from all of that?

Trying to copy biological systems 1:1 rarely works, and copying biological systems doesn't seem to be required either. CNNs are somewhat brain-inspired, but only somewhat, and LLMs have very little architectural similarity to human brain - other than being an artificial neural network.

This functional similarity of LLMs to the human brain doesn't come from reverse engineered details of how the human brain works - it comes from the training process.

  • Marshferm an hour ago

    There's nothing similar about LLMs and human brains. Theyre entirely divergent. Training a machine has nothing remotely to do with biological development.

    • ACCount37 23 minutes ago

      They perform incredibly similar functions. Thus, "functionally similar".

      • Marshferm 18 minutes ago

        There’s no functional similarity in the slightest. Notice you can’t cite examples.

byearthithatius 17 hours ago

This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?

imtringued 2 hours ago

What I personally find amusing is this part:

>3. Interactive: World models can output the next states based on input actions

>Finally, if actions and/or goals are part of the prompt to a world model, its outputs must include the next state of the world, represented either implicitly or explicitly. When given only an action with or without a goal state as the input, the world model should produce an output consistent with the world’s previous state, the intended goal state if any, and its semantic meanings, physical laws, and dynamical behaviors. As spatially intelligent world models become more powerful and robust in their reasoning and generation capabilities, it is conceivable that in the case of a given goal, the world models themselves would be able to predict not only the next state of the world, but also the next actions based on the new state.

That's literally just an RNN (not a transformer). An RNN takes a previous state and an input and produces a new state. If you add a controller on top, it is called model predictive control. The most extreme form I have seen is temporal difference model predictive control (TD-MPC). [0]

[0] https://arxiv.org/abs/2203.04955

porphyra 14 hours ago

> if they have anything figured out besides "collect spatial data" like imagenet

I mean she launched her whole career with imagenet so you can hardly blame her for thinking that way. But on the other hand, there's something bitter lesson-pilled about letting a model "figure out" spatial relationships just by looking at tons of data. And tbh the recent progress [1] of worldlabs.ai (Dr Fei Fei Li's startup) looks quite promising for a model that understands stuff including reflections and stuff.

[1] https://www.worldlabs.ai/blog/rtfm

  • godelski 14 hours ago

      > looks quite promising for a model that understands stuff including reflections and stuff.
    
    I got the opposite impression when trying their demo...[0]. Even in their examples some of these issues exist like how objects stay a constant size despite moving. Like missing the parallax or depth information. Not to mention that they show it walking on water lol

    As for reflections, I don't get that impression either. They seem extremely brittle to movement.

    [0] http://0x0.st/K95T.png