Comment by echelon

Comment by echelon 12 hours ago

7 replies

We started putting them in image and video models and now image and video models are insane.

I think the next period of high and rapid growth will be in media (image, video, sound, 3D), not text.

It's much harder to adapt LLMs to solving business use cases with text. Each problem is niche, you have to custom tailor the solution, and the tooling is crude.

The media use cases, by contrast, are low hanging fruit and result in 10,000x speedups and cost reductions almost immediately. The models are pure magic.

I think more companies would be wise to ignore text for now and focus on visual domain problems.

Nano Banana has so much more utility than agents. And there are so many low hanging fruit ways to make lots of money.

Don't sleep on image and video. That's where the growth salient is.

wild_egg 12 hours ago

> Nano Banana has so much more utility than agents.

I am so far removed from multimedia spaces that I truly can't imagine a universe where this could be true. Agents have done incredible things for me and Nano Banana has been a cool gimmick for making memes.

Anyone have a use case for media models that'll expand my mind here?

  • echelon 11 hours ago

    We now have capacity to program and automate in the optics, signals, and spatial domains.

    As someone in the film space, here's just one example: we are getting extremely close to being able to make films with only AI tools.

    Nano Banana makes it easy to create character and location consistent shots that adhere to film language and the rules of storytelling. This still isn't "one shot", and considerable effort still needs to be put in by humans. Not unlike AI assistance in IDEs requiring a human engineer pilot.

    We're entering the era of two person film studios. You'll undoubtedly start seeing AI short films next year. I had one art school professor tell me that film seems like it's turning into animation, and that "photorealism" is just style transfer or an aesthetic choice.

    The film space is hardly the only space where these models have utility. There are so many domains. News, shopping, gaming, social media, phone and teleconference, music, game NPCs, GIS, design, marketing, sales, pitching, fashion, sports, all of entertainment, consumer, CAD, navigation, industrial design, even crazy stuff like VTubing, improv, and LARPing. So much of what we do as humans is non-text based. We haven't had effective automation for any of this until this point.

    This is a huge percentage of the economy. This is actually the beating heart of it all.

    • wild_egg 3 hours ago

      Been thinking about this. Curious why you positioned it as Nano Banana having more utility than agents when it seems like the next level even would be Nano Banana with agents?

      The two are kind of orthogonal concepts.

    • yunwal 11 hours ago

      > we are getting extremely close to being able to make films with only AI tools

      AI still can’t reliably write text on background details. It can’t get shadows right. If you ask it to shoot things from a head on perspective, for example a bookshelf, it fails to keep proportions accurate enough. The bookshelf will not have parallel shelves. The books won’t have text. If in a library, the labels will not be in Dewey decimal order.

      It still lacks a huge amount of understanding about how the world works necessary to make a film. It has its uses, but pretending like it can make a whole movie is laughable.

      • wild_egg 9 hours ago

        I don't think they're suggesting AI could one-shot a whole movie. It would be iterative, just like programming.

        • echelon 3 hours ago

          Exactly. You can still open the generations in Photoshop.

          I'd say the image and video tools are much further along and much more useful than AI code gen (not to dunk on code autocomplete). They save so much time and are quite incredible at what they can do.

      • gabriel666smith 9 hours ago

        I don't think equating "extremely close" with "pretending like it can" is a fair way to frame the sentiment of the comment you were replying to. Saying something is close to doing something is not the same as saying it already can.

        In terms of cinema tech, it took us arguably until the early 1940s to achieve "deep focus in artificial light". About 50 years!

        The last couple of years of development in generative video looks, to me, like the tech is improving more quickly than the tech it is mimicking did. This seems unsurprising - one was definitely a hardware problem, and the other is most likely a mixture of hardware and software problems.

        Your complaints (or analogous technical complaints) would have been acceptable issues - things one had to work around - for a good deal of cinema history.

        We've already reached people complaining about "these book spines are illegible", which feels very close to "it's difficult to shoot in focus, indoors". Will that take four or five decades to achieve, based on the last 3 - 5 years of development?

        The tech certainly isn't there yet, nor am I pretending like it is, and nor was the comment you replied to. To call it close is not laughable, though, in the historical context.

        The much more interesting question is: At what point is there an audience for the output? That's the one that will actually matter - not whether it's possible to replicate Citizen Kane.