Comment by lc64

Comment by lc64 a year ago

"was trained on <100 hours of audio"

How the hell was it trained on that little data ?

bbminner a year ago

I suppose it means per speaker. And it is based on a simplified style tts 2 which from my small dive into the subject seems one of the smaller models achieving great quality.

Reply View 0 replies

[removed] a year ago

[deleted]

Reply View 0 replies

Havoc a year ago

Yeah that surprised me as well - seems low vs what is used on text llms . To be fair 100 hours of speaking is a lot of speaking though

Reply View 2 replies

edude03 a year ago

But it covers five? Languages so if all equal it’s just 20 hours per language.

Reply View | 1 reply
- em-bee a year ago
  
  in the linked audio sample it says the training data is mostly english. also another comment claims that the japanese quality is not good, so i'd be suspicious about all the other languages.
  
  Reply View | 0 replies