I suppose it means per speaker. And it is based on a simplified style tts 2 which from my small dive into the subject seems one of the smaller models achieving great quality.
in the linked audio sample it says the training data is mostly english. also another comment claims that the japanese quality is not good, so i'd be suspicious about all the other languages.
I suppose it means per speaker. And it is based on a simplified style tts 2 which from my small dive into the subject seems one of the smaller models achieving great quality.