Comment by slashdave

Comment by slashdave 3 days ago

View on Hacker News

No, it actually needs none of that.

in-silico 2 days ago

How would it do what it does without those things?

Reply View 4 replies

slashdave 2 days ago

Like all these models work, by simple interpolation.

Reply View | 3 replies
- in-silico 2 days ago
  
  But how does it interpolate?
  
  Reply View | 2 replies
  
  slashdave 2 days ago
  
  Pixel by pixel, time-slice by time-slice, in a 2D+T convolution. You provide enough examples of videos of changing point-of-view, and the model reproduces what it is given.
  
  Reply View | 1 reply
  
  in-silico 2 days ago
  
  Yes, it reproduces what it is given by modelling the rules of physics, geometry, etc.
  For example, image generators like stable diffusion carry strong representations of depth and geometry, such that performant depth estimation models can be built out of them with minimal retraining. This continues to be true for video generation models.
  Early work on the subject: https://arxiv.org/pdf/2409.09144
  
  Reply View | 0 replies