Comment by miki123211

Yeah, Eleven Labs must be raking it in.

You can get hours of audio out of it for free with Eleven Reader, which suggests that their inference costs aren't that high. Meanwhile, those same few hours of audio, at the exact same quality, would cost something like $100 when generated through their website or API, a lot more than any other provider out there. Their pricing (and especially API pricing) makes no sense, not unless it's just price discrimination.

Somebody with slightly deeper pockets than academics or one guy in a garage needs to start competing with them and drive costs down.

Open TTS models don't even seem to utilize audiobooks or data scraped off the internet, most are still Librivox / LJ Speech. That's like training an LLM on just Wikipedia and expecting great results. That may have worked in 2018, but even in 2020 we knew better, not to mention 2025.

TTS models never had their "Stable Diffusion moment", it's time we get one. I think all it would take is somebody doing open-weight models applying the lessons we learned from LLMs and image generation to TTS models, namely more data, more scraping, more GPUs, less qualms and less safety. Eleven Labs already did, and they're profiting from it handsomely.

pzo 8 months ago

Kokoro gives great results especially when speaking english. Model is small enough to run even on smartphone ~3x faster than realtime.

Reply View 3 replies

miki123211 8 months ago

Kokoro just proves my point; it's "one guy in a garage", 1000 hours of distilled audio (I think) and ~100m params.
With the budget one tenth that of Stable Diffusion and less ethical qualms, you could easily 10x or 100x this.

Reply View | 1 reply
- cchance 8 months ago
  
  I'm actually surprised people aren't just using elevenreader to generate solid content from various books for datasets lol
  
  Reply View | 0 replies
bavell 8 months ago

Another +1 to Kokoro from me, great quality with good speed.

Reply View | 0 replies

bazlan 8 months ago

[dead]

Reply View 0 replies