toisanji 16 hours ago

From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.

There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience. Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information. I wrote this before the huge LLM explosion and I still personally believe it is the path forward.

  • Animats 9 hours ago

    > From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.

    Right. I was thinking about this back in the 1990s. That resulted in a years-long detour through collision detection, physically based animation, solving stiff systems of nonlinear equations, and a way to do legged running over rough terrain. But nothing like "AI". More of a precursor to the analytical solutions of the early Boston Dynamics era.

    Work today seems to throw vast amounts of compute at the problem and hope a learning system will come up with a useful internal representation of the spatial world. It's the "bitter lesson" approach. Maybe it will work. Robotic legged locomotion is pretty good now. Manipulation in unstructured situations still sucks. It's amazing how bad it is. There are videos of unstructured robot manipulation from McCarthy's lab at Stanford in the 1960s. They're not that much worse than videos today.

    I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.

    There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?

    • imtringued 2 minutes ago

      >There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?

      https://www.youtube.com/watch?v=udPY5rQVoW0

      This has been a thing for a while. It's actually a funny way to demonstrate model based control by replacing the controller with a human.

    • Earw0rm 8 hours ago

      I share your surprise regarding LLM, is it fair to say that it's because language - especially formalised, written language - is a self-describing system.

      A machine can infer the right (or expected) answer based on data, I'm not sure that the same is true for how living things navigate the physical world - the "right" answer, such as one exists for your squirrel, is arguably Darwinian: "whatever keeps the little guy alive today".

    • nosianu 3 hours ago

      > I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.

      "AI" is not based on physical real world data and models like our brain. Instead, we chose to analyze human formal (written) communication. ("formal": actual face to face communication has tons of dimensions adding to the text representation of what is said, from tone, speed to whole body and facial expressions)

      Bio-brains have a model based on physical sensor data first and go from there, that's completely missing from "AI".

      In hindsight, it's not surprising, we skipped that hard part (for now?). Working with symbols is what we've been doing with IT for a long time.

      I'm not sure going all out on trying to base something on human intelligence, i.e. human neuro networks, is a winning move. I see it as if we had been trying to create airplanes that flap their wings. For one, human intelligence already exists, and when you lean back and manage to look at how we do on small and large problems from an outside perspective it has plenty of blind spots and disadvantages.

      I'm afraid if we were to manage a hundred percent human level intelligence AI we will be disappointed. Sure, it will be able to do a lot, but in the end, nothing we don't already have.

      Right now that would also just be the abstract parts, I think the "moving the body" physical parts in relation to abstract commands would be the far more interesting part, but since current AI is not about using physical sensor data at all, never mind combining it with the abstract stuff...

      • nycdatasci 7 minutes ago

        You seem to be suggesting that current frontier models are only trained on text and not "sensor data". Multi-modal models are trained on the entire internet + vast amounts of synthetic data. Images and videos are key inputs. Camera sensors are capable of capturing much more "sensor data" than the human eye. Neural networks are the worst way to model intelligence, except all other models.

        You may find this talk enlightening: https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023...

  • hliyan 5 hours ago

    I kept reading, waiting for a definition of spatial intelligence, but gave up after a few paragraphs. After years of reading VC-funded startup fluff, writing that contain these words tend to put me off now: transform, revolutionize, next frontier, North Star.

  • bonsai_spool 16 hours ago

    > Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.

    Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.

    https://www.nobelprize.org/prizes/medicine/2014/press-releas...

    • Marshferm 15 hours ago

      It's not enough by a long shot. Placement isn't related directly to vicarious trial and error, path integrations, sequence generation.

      There's a whole giant gap between grid cells and intelligence.

      • teleforce 9 hours ago

        >There's a whole giant gap between grid cells and intelligence.

        Please check this recent article on the state machine in the hippocampus based on learning [1]. The findings support the long-standing proposal that sparse orthogonal representations are a powerful mechanism for memory and intelligence.

        [1] Learning produces an orthogonalized state machine in the hippocampus:

        https://www.nature.com/articles/s41586-024-08548-w

        • Marshferm 2 hours ago

          Of course, but the mechanisms “remain obscure”. The entorhinal cortex is but a facet of this puzzle and placement vs head direction etc must be understood beyond mere prediction. There are too many essential parts that are not understood particularly senses and emotion which play the tinkering precursors to evolutionary function that are excluded now as well as the likelyhood that prediction error and prediction are but mistaken precursor computational bottlenecks to unpredictability. Pushing AI into the 4% of a process materially identified as entorhinal is way premature.

          This approach simply follows suit with the blundering reverse engineering of the brain in cog sci where material properties are seen in isolation and processes are deduced piecemeal. The brain can only be understood as a whole first. See rhythms of the brain or unlocking the brain.

          There’s a terrifying lack of curiosity in the paper you posted, a kind of smug synthetic rush to import code into a part of the brain that’s a directory among directories that has redundancies as a warning: we get along without this.

          Your and their view (OSM) is too narrow. eg categorization is baked into the whole brain. How? This is one of 1000s of processes that generalize materially across the entire brain. Isolating "learning" to the allocortex is incredibly misleading.

          https://www.cell.com/current-biology/fulltext/S0960-9822(25)...

  • ACCount37 2 hours ago

    The question, as always, is: can we get any useful insights from all of that?

    Trying to copy biological systems 1:1 rarely works, and copying biological systems doesn't seem to be required either. CNNs are somewhat brain-inspired, but only somewhat, and LLMs have very little architectural similarity to human brain - other than being an artificial neural network.

    This functional similarity of LLMs to the human brain doesn't come from reverse engineered details of how the human brain works - it comes from the training process.

  • imtringued 10 minutes ago

    What I personally find amusing is this part:

    >3. Interactive: World models can output the next states based on input actions

    >Finally, if actions and/or goals are part of the prompt to a world model, its outputs must include the next state of the world, represented either implicitly or explicitly. When given only an action with or without a goal state as the input, the world model should produce an output consistent with the world’s previous state, the intended goal state if any, and its semantic meanings, physical laws, and dynamical behaviors. As spatially intelligent world models become more powerful and robust in their reasoning and generation capabilities, it is conceivable that in the case of a given goal, the world models themselves would be able to predict not only the next state of the world, but also the next actions based on the new state.

    That's literally just an RNN (not a transformer). An RNN takes a previous state and an input and produces a new state. If you add a controller on top, it is called model predictive control. The most extreme form I have seen is temporal difference model predictive control (TD-MPC). [0]

    [0] https://arxiv.org/abs/2203.04955

  • juliangamble 7 hours ago

    Thanks for your article. The references section was interesting.

    I'll add to the discussion a 2018 Nature letter: "Vector-based navigation using grid-like representations in artificial agents" https://www.nature.com/articles/s41586-018-0102-6

    and a 2024 Nature article "Modeling hippocampal spatial cells in rodents navigating in 3D environments" https://www.nature.com/articles/s41598-024-66755-x

    And a simulation in Github from 2018 https://github.com/google-deepmind/grid-cells

    People have been looking at spacial awareness in neurology for quite a while. (In terms of the timeframe of recent developments in LLMs).

  • byearthithatius 15 hours ago

    This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?

  • porphyra 13 hours ago

    > if they have anything figured out besides "collect spatial data" like imagenet

    I mean she launched her whole career with imagenet so you can hardly blame her for thinking that way. But on the other hand, there's something bitter lesson-pilled about letting a model "figure out" spatial relationships just by looking at tons of data. And tbh the recent progress [1] of worldlabs.ai (Dr Fei Fei Li's startup) looks quite promising for a model that understands stuff including reflections and stuff.

    [1] https://www.worldlabs.ai/blog/rtfm

    • godelski 12 hours ago

        > looks quite promising for a model that understands stuff including reflections and stuff.
      
      I got the opposite impression when trying their demo...[0]. Even in their examples some of these issues exist like how objects stay a constant size despite moving. Like missing the parallax or depth information. Not to mention that they show it walking on water lol

      As for reflections, I don't get that impression either. They seem extremely brittle to movement.

      [0] http://0x0.st/K95T.png

jandrewrogers 16 hours ago

This is essentially a simulation system for operating on narrowly constrained virtual worlds. It is pretty well-understood that these don't translate to learning non-trivial dynamics in the physical world, which is where most of the interesting applications are.

While virtual world systems and physical world systems look similar based on description, a bit like chemistry and chemical engineering, they are largely unrelated problems with limited theory overlap. A virtual world model is essentially a special trivial case that becomes tractable because it defines away most of the hard computer science problems in physical world models.

A good argument could be made that spatial intelligence is a critical frontier for AI, many open problems are reducible to this. I don't see any evidence that this company is positioned to make material progress on it.

inciampati 15 hours ago

Just had a fantastic experience applying agentic coding to CAD. I needed to add some threads to a few blanks in a 3d print. I used computational geometry to give the agent a way to "feel" around the model. I had it convolve a sphere of the radius of the connector across the entire model. It was able to use this technique to find the precise positions of the existing ports and then add threads to them. It took a few tries to get right, but if I had the technique in mind before it would be very quick. The lesson for me is that the models need to have a way to feel. In the end, the implementation of the 3d model had to be written in code, where it's auditable. Perhaps if the agent were able to see images directly and perfectly, I never would have made this discovery.

  • alexose 13 hours ago

    Generative CAD has incredible potential. I've had some decent results with OpenSCAD, but it's clear that current models don't have much "common sense" when it comes to how shapes connect.

    If code-based CAD tools were more common, and we had a bigger corpus to pull from, these tools would probably be pretty usable. Without this, however, it seems like we'll need to train against simulations of the physical world.

    • [removed] 12 hours ago
      [deleted]
  • t_mann 14 hours ago

    CadQuery? Would be appreciated if you're inclined to do writeup of your lessons learned.

  • JohnHammersley 14 hours ago

    Thanks for sharing, I'm interested to know more about how you did this if you have a longer write up somewhere? (or are considering writing one!)

  • btbuildem 14 hours ago

    I'd love to hear more about this -- I'm messing around with a generative approach to 3D objects

  • mkoubaa 12 hours ago

    Unlike an LLM prompt it's REALLY hard to describe the end result of a geometric object in text.

    "No put the thingy over there. Not that thingy!"

    • nfg 8 hours ago

      I’m not really suggesting it’s the right approach for CAD but prompting UI changes using sketches or mockup images works great.

in-silico 15 hours ago

Genie 3 (at a prototype level) achieves the goal she describes: a controllable world model with consistency and realistic physics. Its sibling Veo 3 even demonstrates some [spatial problem-solving ability](https://video-zero-shot.github.io/). Genie and Veo are definitely closer to her vision than anything World Labs has released publicly.

However, she does not mention Google's models at all. This omission makes the blog feel very much like an ad for her company rather than a good-faith guide for the field.

voxleone 2 hours ago

>>Spatial Intelligence is the scaffolding upon which our cognition is built.

Human cognition isn’t built on abstract reasoning alone. It’s embodied, grounded in sensation.

Evolution didn’t achieve generalization across domains by making brains more symbolic. It did so by making them more integrated by fusing chemical gradients, touch, proprioception, light, sound, temperature, and pressure into one continuous internal narrative.

Intelligence does not seem to be an algorithmic property; it’s a felt coherence across senses. Our reasoning emerges from a complex interaction of sensory information, memory, emotions, and cognitive processing. Sensory completeness is the way forward.

jacquesm 15 hours ago

I think I perceive a massive bottleneck. Today's incarnation of AI learns from the web, not from the interaction with the humans it talks to. And for sure there is a lot of value there, it is just pointless to see that interaction lost a few hundred or thousand words of context later. For humans their 'context' is their life and total memory capacity, that's why we learn from the interaction with other, more experienced humans. It is always a two way street. But with AI as it is, it is a one way street, one that means that your interaction and your endless corrections when it gets stuff wrong (again) is lost. Allowing for a personalized massive context would go a long way towards improving the value here, at least like that you - hopefully - only have to make the same correction once.

  • tim333 14 hours ago

    There was stuff on a possible way around that from Google Research out the other day called Nested Learning https://research.google/blog/introducing-nested-learning-a-n...

    My understanding is at the moment you train something like ChatGPT on the web, setting weights with backpropagation till it works well, but if you give some more info and do more backprop it can forget other stuff it's learned, called 'catastrophic forgetting'. The nested learning approach is to split things into a number of smaller models so you can retrain one without mucking up the other ones.

  • halfcat 13 hours ago

    > For humans their 'context' is their life and total memory capacity

    And some number of billions of years of evolutionary progress.

    Whatever spacial understanding we have could be thought of as a simulation at a quantum level, the size of the universe, for billions of years.

    And what can we simulate completely at a quantum level today? Atoms or single cells?

    • jacquesm 11 hours ago

      Idealized atoms and very, very simplified single cells.

gnarlouse 9 hours ago

This article has me thinking about “the human capacity to outthink nature and the scalability of this.” The wheel is sort of the first time I think man outthought nature: Nature is inherently bumpy and noisy, rolling is certainly a great form of locomotion, but it’s not reliable. When man figured out how to make long tracts of flat land (roads), we outthought nature. In some sense you could argue that our entire tranjectory through science and technology, supported by the scientific method, is another example: nature sort of sucks at persisting high level pattern intuition between one generation to the next, basically anything beyond genes.

I keep going back and forth on whether I think “super-intelligence” is achievable in any form other than speed-super-intelligence, but I definitely think that being able to think 3-dimensionally capably will be a major building block to AI outthinking man, and outthinking nature.

Sort of a shitpost.

  • djtango 9 hours ago

    The human body is an organised system of cells contributing to a greater whole - is there much difference between blood vessels designed for the efficient transport of key resources and messengers across the body and roads that carry key resources and messengers across a landmass?

    In that sense has nature just replicated its ability to organise but at the species level on a planetary (interplanetary soon) scale?

    Why are humans above nature...?

    • gnarlouse 8 hours ago

      Fair, i mean i also love the argument that there’s really no difference between “the manmade world” and “the natural world”because the former is entirely composed of parts stripped from or chemically altered from the latter. So yes, nature has absolutely replicated its ability to organize at a species level through human ingenuity.

      Humans are maybe separate from nature primarily on the basis of our attempts (of varying success) to steer, structure, and optimize the organization of nature around us, and knowing how to do so is not an explicit aspect of reality, or at least did not make itself known to early humans so it’s reasonable to believe it’s not explicit. By that, I mean you’re not born with any inherent knowledge of the inner workings of quantum gravity, or of the navies stokes equation, or any of the tooling that supports it, but clearly these models exist and evolve tangibly around us in every moment. We found something nature hid from DNA-based biological tree of life, and exploited it to great effect.

      Again, this is a colossal shitpost.

atlex2 11 hours ago

I think spatial tokens could help, but they're not really necessary. Lots of physics/physical tasks can be solved with pencil and paper.

On the other hand, it's amazing that a 512x512 image can be represented by 85 tokens (as in OAI's API), or 263 tokens per second for video (with Gemini). It's as if the memory vs compute tradeoff has morphed into a memory vs embedding question.

This dichotomy reminds me of the "Apple Rotators - can you rotate the Apple in your head" question. The spatial embeddings will likely solve dynamics questions a lot more intuitively (ie, without extended thinking).

We're also working on this space at FlyShirley - training pilots to fly then training Shirley to fly - where we benefit from established simulation tools. Looking forward to trying Fei Fei's models!

godelski 12 hours ago

I think a lot of people are really bad at evaluating world models. Feifei is right here that they are multimodal but really they must codify a physics. I don't mean "physics" but "a physics". I also think it's naïve to think this can be done through data alone. I mean just ask a physicist...[0].

But why people are really bad at evaluating them is because the details dominate. What matters here is consistency. We need invariance to some things and equivariance to others. As evaluators we tend to be hopeful so the subtle changes frame to frame are overlooked though thats kinda the most important part. It can't just be similar to the last frame, but needs be exactly the same. You need equivariance to translation, yet that's still not happening in any of these models (and it's not a limitation of attention or transformers). You're just going to have a really hard time getting all this data even though by doing that you'll look like you're progressing because you're better fitting it. But in the end the models will need to create some compact formulation representing concepts such as motion. Or in other words, a physics. And it's not like physicists aren't know for being detail oriented and nitpicky over nuances. That is breed into then with good reason

[0] https://m.youtube.com/watch?v=hV41QEKiMlM

  • ontouchstart 3 hours ago

    The YouTube video tells a fascinating story. Who would be our Fermi today who can tell the truth and save five years of work, billions of dollars and careers of Ph.D. students?

    We wouldn’t expect LLM to review a paper and tell us the truth like Fermi did. That is super-intelligence.

    Thanks for sharing.

Iolaum 3 hours ago

Noob question: Aren't self-driving cars' software/AI solving that?

verdverm 16 hours ago

I do wonder if this will meaningfully move the needle on agent assistants (coding, marketing, schedule my vacation, etc...) considering how much more compute (I would imagine) is needed for video / immersive environments during training and inference

I suspect the calculus is more favorable for robotics

yalogin 14 hours ago

Isn’t this what all the ai companies are doing now? This is what is needed to enable robotics with llms and deep mind and others are all actively working on it afaik

olirex99 7 hours ago

Spatial AI will for sure be a thing, I am not sure if will be next frontier.

The main problem that I still see is: we are not able to fully understand how much can we scale the current models. How much data do we need? Do we have the data for this kind of training? Can the current models generalize the world?

Probably before seeing something really interesting we need another AI winter, where researchers can be researcher and not soldiers of companies.

  • pmontra 6 hours ago

    The data is out there if we give at least wheels to a robot and let it bump into things like we did when we were little. We didn't need a billion pictures or videos. Only trial and error, then we developed a mental map of our home and our close neighborhood and discovered that the rest of the world obeys the same rules. Training AIs doesn't work like that now.

    I think that they want to follow the same route of LLMs: no understanding of the real world, but finding a brute force approach that's good enough in the most useful scenarios. Same as airplanes: they can't fly in a bird like way and they can't do bird things (land on a branch) but they are crazily useful to go to the other side of the world in a day. They need a lot of brute force to do that.

    And yes, maybe an AI winter is what is needed to have the time to stop and have some new ideas.

jgord 13 hours ago

My take, after working on some algos to detect geometry from pointclouds, is that its solvable with current ML techniques, but we lack early stage VC funding for startups working on this :

https://quantblog.wordpress.com/2025/10/29/digital-twins-the...

I have no doubt FeiFei and her well funded team will make rapid progress.

  • john_minsk 13 hours ago

    We think alike. Have you tried to replace point cloud of white wall with a generic white wall automatically?

segmondy 12 hours ago

I would argue that some would add time to that as well, a lot of our data are missing spatial and temporal information. But if we're able to take text2text models and add in audio/vision then I suspect we can apply the same technique to add in spatial and temporal intelligence. However the data for those are non existent unlike audio and visual data.

htrp 16 hours ago

her company world labs is at the forefront of building spatial intelligence models

sbinnee 6 hours ago

So Dr.Li starts writing a blog! I just subscribed to it. I cannot wait for other articles!

wangii 7 hours ago

she's done pretty important work but since then obsessed with the vague term `spatial intelligence`. what does it mean? there isn't a clear definition in the piece. it seems very intuitive & fundamental but tbh not *rigorous*, nor insightful.

I bet it's a dead end.

  • sbinnee 6 hours ago

    It's rare for one person to achieve many things. Her ImageNet was certainly HUGE. But she is a researcher. I think the true power of researchers is to persist. I also often think that researchers are too much absorbed into their topics. But that is just their purpose.

    It could be a dead end for sure. I just hope that someone figures out the `spatial` part for AIs and brings us closer to better ones.

wartywhoa23 5 hours ago

Sure, because drones of the global City 17 can't fly blind.

brrrrrm 10 hours ago

we've discovered some kind of differentiable computer[1] and as with all computers, people have their own interests and hobbies they use them for. but unlike computers, everyone pitches their interest or hobby as being the only one that matters.

[1] https://x.com/karpathy/status/1582807367988654081

[removed] 16 hours ago
[deleted]
alyxya 16 hours ago

Personally, I think the direction AI will go towards is having an AI brain with something like a LLM at its core augmented with various abilities like spatial intelligence, rather than models being designed with spatial reasoning at its core. Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning. Similar to how image generation models have struggled with generating the right number of fingers on hands, I would expect a world model designed to model physical space to not generalize the understanding of simple human ideas.

  • gf000 15 hours ago

    > Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning

    I believe the zero hypothesis would be that a model natively understanding both would work best/come closest to human intelligence (and possibly other different modalities are also needed).

    Also, as a complete laymen, our language having several interconnections with spatial concepts would also point towards a multi-modal intelligence. (topic: place, subject: lying under or near, respect/prospect: look back/ahead, etc). In my understanding these connections only secondarily make their way into LLM's representations.

    • alyxya 15 hours ago

      There's a difference between what a model is trained on and the inductive biases a model uses to generalize. It isn't as simple as saying training natively on everything. All existing models have certain things they generalize better and certain things they don't generalize due to their model architecture, and the architecture of world models I've seen don't seem as capable of universally generalizing as LLMs.

andy_ppp 12 hours ago

Not sure I want a robot that hallucinates around the home but okay if it folds my laundry and cleans the house and so on!

inshard 15 hours ago

Also good context here is Friston’s Free Energy Principle: A unified theory suggesting that all living systems, from simple organisms to the brain, must minimize "surprise" to maintain their form and survive. To do this, systems act to minimize a mathematical quantity called variational free energy, which is an upper bound on surprise. This involves constantly making predictions about the world, updating internal models based on sensory data, and taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors.

Key distinction: Constant and continuous updating. I.e. feedback loops with observation, prediction, action (agency), and once more, observation.

It should have survival and preservation as a fundamental architectural feature.

  • lsllc 14 hours ago

    > taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors

    Since you can't change reality itself, and you can only take actions to reduce variational free energy, doesn't this make everything into a self-fulfilling prophecy?

    I guess there must be some base level of instinct that overrides this; in the case of "I think that sabertooth tiger is going to eat me" you want to make sure the "don't get eaten" instinct counters "minimizing prediction errors".

    • inshard 12 hours ago

      Yep. Essentially take risks, expand your world model, but above all, don’t die. There’s a tension there - like “what happens if I poke the bear” vs “this might get me killed.”

jillesvangurp 5 hours ago

The article is a bit of a long read but I've been looking at this topic for some time and things are definitely improving.

We build a map-based productivity app for workers. We map out the workplace and use e.g. asset tracking to visualize where things are and help people find these things and navigate around. There's a lot more to this of course, but we typically geo-reference whatever building maps we can get our hands on on top of openstreetmaps. This allows people to zoom in and out and switch between indoor and outdoor.

The hard part for us: sourcing decent building maps. There usually is some building map information available in the form of cad drawings, fire escape plans, etc. But they aren't really well suited for use as a user friendly map. Also typically getting vector graphics for this is hard. In short, we usually have to spend quite a bit of effort on designing or sourcing maps. And of course these maps aren't static. People extend buildings, move equipment and machines around, and re-purpose the spaces they have. A map of a typical factory is an empty rectangle. You can see where the walls, windows, and doors are and any supporting columns. All the interesting stuff happens in the negative spaces (the blank space between the walls).

Mapping all this is a manual process because it requires people to interpret raw spatial data in context. We build our own world model. A great analogy is text based adventure games where the only map you had was what you built in your head from querying the game. It's a surprisingly hard problem. We're used to decent quality public maps outdoors; but indoors there isn't much. Making outdoor maps is something that is quite expensive but lucrative enough that companies have been investing in that for years. Also openstreetmap has tapped into a huge community of people that manually edit things and/or integrate third party data sets (a lot of stuff is imported as well).

Recently with Google's nano banana model creating building maps got a lot easier. It has some notion of proportions and dimensions. I was able to take a smart phone photo of the fire escape plan mounted to the wall and then I let nano banana clean it up and transform it; without destroying dimensions or hallucinating new walls, doors, windows, etc. or changing the dimensions of rooms. We've also been experimenting with turning bitmaps into vector graphics which can work with promising results but this still needs work. But even just a cleaned up fire escape plan minus all the escape routes, and other map clutter is already a massive improvement for us. Fire escape plans are everywhere and are kind of the base line map we can get for pretty much any building provided they are to scale. Which at least in Germany they are (standards for this are pretty strict).

AI-based map content creation from photos, reference cad diagrams, textual descriptions, etc. is what we want to target next. Given some basic cad map and a photo in the building, can we deduce the vantage point from which the photo was taken and then identify things in the photo and put them on the map in the correct position. People are actually able to do this with enough context. That's what openstreetmap editors do when they add detail to the map. AI models so far don't quite do all of this yet. Essentially this is about creating an accurate world model and using that to populate maps with content. It's not just about things like lidar and stereo vision but about understanding what is what in a photo.

In any case, that's just one example of where I see a lot of potential for smarter models. Nano banana was the first model to not make a mess of our maps.

dauertewigkeit 16 hours ago

Sutton: Reinforcement Learning

LeCun: Energy Based Self-Supervised Learning

Chollet: Program Synthesis

Fei-Fei: ???

Are there any others with hot takes on the future architectures and techniques needed for of A-not-quite-G-I?

  • yzydserd 16 hours ago

    > Fei-Fei: ???

    Underrated and unsung. Fei Fei Li first launched ImageNet way back in 2007, a hugely influential move sparking much of the computer vision deep learning that followed since. I remember in a lecture about 7 years ago jph00 saying "text is just waiting for its imagenet moment" -> then came the gpt explosion. Fei Fei was massively instrumental in where we are today.

    • byearthithatius 15 hours ago

      Curating a dataset is vastly different than introducing a new architectural approach. ImageNet is a database. Its not like inventing the convolutions for CNNs or the LSTM or a Transformer.

      • davmre 13 hours ago

        It's true that these are very different activities, but I think most ML researchers would agree that it's actually the creation of ImageNet that sparked the deep learning revolution. CNNs were not a novel method in 2012; the novelty was having a dataset big and sophisticated enough that it was actually possible to learn a good vision model from without needing to hand-engineer all the parts. Fei-fei saw this years in advance and invested a lot of time and career capital setting up the conditions for the bitter lesson to kick in. Building the dataset was 'easy' in a technical sense, but knowing that a big dataset was what the field needed, and staking her career on it when no one else was doing or valuing this kind of work, was her unique contribution, and took quite a bit of both insight and courage.

      • dauertewigkeit 15 hours ago

        CNNs and Transformers are both really simple and intuitive so I don't think there is any stroke of genius in how they were devised.

        Their success is due to datasets and the tooling that allowed models to be trained on large amounts of data, sufficiently fast using GPU clusters.

[removed] 14 hours ago
[deleted]
gradus_ad 16 hours ago

I'd imagine Tesla's and Waymo's AI are at the forefront of spatial cognition... this is what has made me hesitant to dismiss the AI hype as a bubble. Once spatial cognition is solved to the extent that language has been solved, a range of applications currently unavailable will drive a tidal wave of compute demand. Beyond self driving, think fully autonomous drone swarms... Militaries around the world certainly are and they're salivating.

  • jandrewrogers 15 hours ago

    The automotive AIs are narrow pseudo-spatial models that are good at extracting spatial features from the environment to feed fairly simple non-spatial models. They don't really reason spatially in the same sense that an animal does. A tremendous amount of human cognitive effort goes into updating the maps that these systems rely on.

    • gradus_ad 15 hours ago

      Help me understand - my mental model of how auto AI work is that they're using neural nets to process visual information and output a decision on where to move in relation to objects in the world around them. Yes they are moving in a constrained 2D space but is that not fundamentally what animals do?

      • abstractanimal 15 hours ago

        What you're describing is what's known as an "end to end" model that takes in image pixels and outputs steering and throttle commands. What happens in an AV is that a bunch of ML models produce input for software written by human engineers, and so the output doesn't come from an entirely ML system, it's a mix of engineered and trained components for various identifiable tasks (perception, planning, prediction, controls).

  • dauertewigkeit 16 hours ago

    Tesla's problems with their multi camera non-Lidar system is precisely because they don't have any spacial cognition.

  • byearthithatius 15 hours ago

    100% agree but not just military. Self-driving vehicles will become the norm, robots to mow the lawn, clean the house, eventually humanoids that can interact like LLMs and be functional robots that help out around the house.

  • pharrington 15 hours ago

    Spatial cognition really means "autonomous robot," and nobody thinks Tesla or Waymo have the most advanced robots.

t0lo 7 hours ago

hype buzz malarky that is going to further lobotomise our children and return more value for shareholders

zombot 8 hours ago

Getting basic trivial shit right is AI's next frontier.

programjames 16 hours ago

Far too much marketing speech, far too little math or theory, and completely misses the mark on the 'next frontier'. Maybe four years ago, spatial reasoning was the problem to solve, but by 2022 it was solved. All that remained was scaling up. The actual three next problems to solve (in order of when they will be solved) are:

- Reinforcement Learning (2026)

- General Intelligence (2027)

- Continual Learning (2028)

EDIT: lol, funny how the idiots downvote

  • whatever1 16 hours ago

    Combinatorial search is also a solved problem. We just need a couple of Universes to scale it up.

    • programjames 16 hours ago

      If there isn't a path humans know how to take with their current technology, it isn't a solved problem. It's much different than people training an image model for research purposes, and knowing that $100m in compute is probably enough for a basic video model.

  • 7moritz7 16 hours ago

    Hasn't RLHF and with LLM feedback been around for years now

    • programjames 16 hours ago

      Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.

      • storus 15 hours ago

        We might not even need RL as DPO has shown.

        • programjames 15 hours ago

          > if you purely use policy optimization, RLHF will be biased towards short horizons

          > most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly

  • l9o 16 hours ago

    What do you consider "General Intelligence" to be?

    • programjames 16 hours ago

      A good start would be:

      1. Robust to adversarial attacks (e.g. in classification models or LLM steering).

      2. Solving ARC-AGI.

      Current models are optimized to solve the current problem they're presented, not really find the most general problem-solving techniques.

  • koakuma-chan 16 hours ago

    In my thinking what AI lacks is a memory system

    • 7moritz7 16 hours ago

      That has been solved with RAG, OCR-ish image encoding (deepseek recently) and just long context windows in general.

      • Eisenstein 12 hours ago

        RAG is like constantly reading your notes instead of integrating experiences into your processes.

      • koakuma-chan 16 hours ago

        Not really. For example we still can’t get coding agents to work reliably, and I think it’s a memory problem, not a capabilities problem.

        • atlex2 14 hours ago

          On the other hand, test-time weight updates would make model interpretability much harder.

    • [removed] 16 hours ago
      [deleted]
baxuz 3 hours ago

As soon as I see an article on substack, I assume that it's misinformation or has an agenda attached to it.

Proven correct yet again.

frenchie4111 15 hours ago

I enjoy Fei-fei li's communication style. It's straight and to the point in a way that I find very easy to parse. She's one of my primary idols in the AI space these days.