Spatial intelligence is AI’s next frontier

(drfeifei.substack.com)

211 points by mkirchner 17 hours ago

toisanji 16 hours ago

From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.

There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience. Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information. I wrote this before the huge LLM explosion and I still personally believe it is the path forward.

Reply View 19 replies

Animats 9 hours ago

> From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.
Right. I was thinking about this back in the 1990s. That resulted in a years-long detour through collision detection, physically based animation, solving stiff systems of nonlinear equations, and a way to do legged running over rough terrain. But nothing like "AI". More of a precursor to the analytical solutions of the early Boston Dynamics era.
Work today seems to throw vast amounts of compute at the problem and hope a learning system will come up with a useful internal representation of the spatial world. It's the "bitter lesson" approach. Maybe it will work. Robotic legged locomotion is pretty good now. Manipulation in unstructured situations still sucks. It's amazing how bad it is. There are videos of unstructured robot manipulation from McCarthy's lab at Stanford in the 1960s. They're not that much worse than videos today.
I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.
There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?

Reply View | 5 replies
- imtringued 2 minutes ago
  
  >There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?
  https://www.youtube.com/watch?v=udPY5rQVoW0
  This has been a thing for a while. It's actually a funny way to demonstrate model based control by replacing the controller with a human.
  
  Reply View | 0 replies
- Earw0rm 8 hours ago
  
  I share your surprise regarding LLM, is it fair to say that it's because language - especially formalised, written language - is a self-describing system.
  A machine can infer the right (or expected) answer based on data, I'm not sure that the same is true for how living things navigate the physical world - the "right" answer, such as one exists for your squirrel, is arguably Darwinian: "whatever keeps the little guy alive today".
  
  Reply View | 0 replies
- nosianu 3 hours ago
  
  > I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.
  "AI" is not based on physical real world data and models like our brain. Instead, we chose to analyze human formal (written) communication. ("formal": actual face to face communication has tons of dimensions adding to the text representation of what is said, from tone, speed to whole body and facial expressions)
  Bio-brains have a model based on physical sensor data first and go from there, that's completely missing from "AI".
  In hindsight, it's not surprising, we skipped that hard part (for now?). Working with symbols is what we've been doing with IT for a long time.
  I'm not sure going all out on trying to base something on human intelligence, i.e. human neuro networks, is a winning move. I see it as if we had been trying to create airplanes that flap their wings. For one, human intelligence already exists, and when you lean back and manage to look at how we do on small and large problems from an outside perspective it has plenty of blind spots and disadvantages.
  I'm afraid if we were to manage a hundred percent human level intelligence AI we will be disappointed. Sure, it will be able to do a lot, but in the end, nothing we don't already have.
  Right now that would also just be the abstract parts, I think the "moving the body" physical parts in relation to abstract commands would be the far more interesting part, but since current AI is not about using physical sensor data at all, never mind combining it with the abstract stuff...
  
  Reply View | 1 reply
  
  nycdatasci 7 minutes ago
  
  You seem to be suggesting that current frontier models are only trained on text and not "sensor data". Multi-modal models are trained on the entire internet + vast amounts of synthetic data. Images and videos are key inputs. Camera sensors are capable of capturing much more "sensor data" than the human eye. Neural networks are the worst way to model intelligence, except all other models.
  You may find this talk enlightening: https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023...
  
  Reply View | 0 replies
- dingnuts 6 hours ago
  
  [dead]
  
  Reply View | 0 replies
hliyan 5 hours ago

I kept reading, waiting for a definition of spatial intelligence, but gave up after a few paragraphs. After years of reading VC-funded startup fluff, writing that contain these words tend to put me off now: transform, revolutionize, next frontier, North Star.

Reply View | 0 replies
bonsai_spool 16 hours ago

> Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.
Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.
https://www.nobelprize.org/prizes/medicine/2014/press-releas...

Reply View | 3 replies
- Marshferm 15 hours ago
  
  It's not enough by a long shot. Placement isn't related directly to vicarious trial and error, path integrations, sequence generation.
  There's a whole giant gap between grid cells and intelligence.
  
  Reply View | 2 replies
  
  teleforce 9 hours ago
  
  >There's a whole giant gap between grid cells and intelligence.
  Please check this recent article on the state machine in the hippocampus based on learning [1]. The findings support the long-standing proposal that sparse orthogonal representations are a powerful mechanism for memory and intelligence.
  [1] Learning produces an orthogonalized state machine in the hippocampus:
  https://www.nature.com/articles/s41586-024-08548-w
  
  Reply View | 1 reply
  
  Marshferm 2 hours ago
  
  Of course, but the mechanisms “remain obscure”. The entorhinal cortex is but a facet of this puzzle and placement vs head direction etc must be understood beyond mere prediction. There are too many essential parts that are not understood particularly senses and emotion which play the tinkering precursors to evolutionary function that are excluded now as well as the likelyhood that prediction error and prediction are but mistaken precursor computational bottlenecks to unpredictability. Pushing AI into the 4% of a process materially identified as entorhinal is way premature.
  This approach simply follows suit with the blundering reverse engineering of the brain in cog sci where material properties are seen in isolation and processes are deduced piecemeal. The brain can only be understood as a whole first. See rhythms of the brain or unlocking the brain.
  There’s a terrifying lack of curiosity in the paper you posted, a kind of smug synthetic rush to import code into a part of the brain that’s a directory among directories that has redundancies as a warning: we get along without this.
  Your and their view (OSM) is too narrow. eg categorization is baked into the whole brain. How? This is one of 1000s of processes that generalize materially across the entire brain. Isolating "learning" to the allocortex is incredibly misleading.
  https://www.cell.com/current-biology/fulltext/S0960-9822(25)...
  
  Reply View | 0 replies
ACCount37 2 hours ago

The question, as always, is: can we get any useful insights from all of that?
Trying to copy biological systems 1:1 rarely works, and copying biological systems doesn't seem to be required either. CNNs are somewhat brain-inspired, but only somewhat, and LLMs have very little architectural similarity to human brain - other than being an artificial neural network.
This functional similarity of LLMs to the human brain doesn't come from reverse engineered details of how the human brain works - it comes from the training process.

Reply View | 0 replies
imtringued 10 minutes ago

What I personally find amusing is this part:
>3. Interactive: World models can output the next states based on input actions
>Finally, if actions and/or goals are part of the prompt to a world model, its outputs must include the next state of the world, represented either implicitly or explicitly. When given only an action with or without a goal state as the input, the world model should produce an output consistent with the world’s previous state, the intended goal state if any, and its semantic meanings, physical laws, and dynamical behaviors. As spatially intelligent world models become more powerful and robust in their reasoning and generation capabilities, it is conceivable that in the case of a given goal, the world models themselves would be able to predict not only the next state of the world, but also the next actions based on the new state.
That's literally just an RNN (not a transformer). An RNN takes a previous state and an input and produces a new state. If you add a controller on top, it is called model predictive control. The most extreme form I have seen is temporal difference model predictive control (TD-MPC). [0]
[0] https://arxiv.org/abs/2203.04955

Reply View | 0 replies
juliangamble 7 hours ago

Thanks for your article. The references section was interesting.
I'll add to the discussion a 2018 Nature letter: "Vector-based navigation using grid-like representations in artificial agents" https://www.nature.com/articles/s41586-018-0102-6
and a 2024 Nature article "Modeling hippocampal spatial cells in rodents navigating in 3D environments" https://www.nature.com/articles/s41598-024-66755-x
And a simulation in Github from 2018 https://github.com/google-deepmind/grid-cells
People have been looking at spacial awareness in neurology for quite a while. (In terms of the timeframe of recent developments in LLMs).

Reply View | 0 replies
byearthithatius 15 hours ago

This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?

Reply View | 2 replies
- toisanji 15 hours ago
  
  There is research showing that the grid cells also represent abstract reasoning: https://pmc.ncbi.nlm.nih.gov/articles/PMC5248972/
  Deep Mind also did a paper with grid cells a while ago: https://deepmind.google/blog/navigating-with-grid-like-repre...
  
  Reply View | 0 replies
- [removed] 15 hours ago
  
  [deleted]
  
  Reply View | 0 replies
porphyra 13 hours ago

> if they have anything figured out besides "collect spatial data" like imagenet
I mean she launched her whole career with imagenet so you can hardly blame her for thinking that way. But on the other hand, there's something bitter lesson-pilled about letting a model "figure out" spatial relationships just by looking at tons of data. And tbh the recent progress [1] of worldlabs.ai (Dr Fei Fei Li's startup) looks quite promising for a model that understands stuff including reflections and stuff.
[1] https://www.worldlabs.ai/blog/rtfm

Reply View | 1 reply
- godelski 12 hours ago
  
  > looks quite promising for a model that understands stuff including reflections and stuff.
  I got the opposite impression when trying their demo...[0]. Even in their examples some of these issues exist like how objects stay a constant size despite moving. Like missing the parallax or depth information. Not to mention that they show it walking on water lol
  As for reflections, I don't get that impression either. They seem extremely brittle to movement.
  [0] http://0x0.st/K95T.png
  
  Reply View | 0 replies

jandrewrogers 16 hours ago

This is essentially a simulation system for operating on narrowly constrained virtual worlds. It is pretty well-understood that these don't translate to learning non-trivial dynamics in the physical world, which is where most of the interesting applications are.

While virtual world systems and physical world systems look similar based on description, a bit like chemistry and chemical engineering, they are largely unrelated problems with limited theory overlap. A virtual world model is essentially a special trivial case that becomes tractable because it defines away most of the hard computer science problems in physical world models.

A good argument could be made that spatial intelligence is a critical frontier for AI, many open problems are reducible to this. I don't see any evidence that this company is positioned to make material progress on it.