Comment by miki123211

Comment by miki123211 a day ago

4 replies

Yeah, Eleven Labs must be raking it in.

You can get hours of audio out of it for free with Eleven Reader, which suggests that their inference costs aren't that high. Meanwhile, those same few hours of audio, at the exact same quality, would cost something like $100 when generated through their website or API, a lot more than any other provider out there. Their pricing (and especially API pricing) makes no sense, not unless it's just price discrimination.

Somebody with slightly deeper pockets than academics or one guy in a garage needs to start competing with them and drive costs down.

Open TTS models don't even seem to utilize audiobooks or data scraped off the internet, most are still Librivox / LJ Speech. That's like training an LLM on just Wikipedia and expecting great results. That may have worked in 2018, but even in 2020 we knew better, not to mention 2025.

TTS models never had their "Stable Diffusion moment", it's time we get one. I think all it would take is somebody doing open-weight models applying the lessons we learned from LLMs and image generation to TTS models, namely more data, more scraping, more GPUs, less qualms and less safety. Eleven Labs already did, and they're profiting from it handsomely.

pzo a day ago

Kokoro gives great results especially when speaking english. Model is small enough to run even on smartphone ~3x faster than realtime.

  • miki123211 10 hours ago

    Kokoro just proves my point; it's "one guy in a garage", 1000 hours of distilled audio (I think) and ~100m params.

    With the budget one tenth that of Stable Diffusion and less ethical qualms, you could easily 10x or 100x this.

  • bavell a day ago

    Another +1 to Kokoro from me, great quality with good speed.