Comment by delegate
The quality is great (amazing even), but I can't listen to AI generated voices for more than 1 minute. I don't know why, I just don't like it. I immediately skip the video on youtube if the voice is AI generated.
Might be because our brains try to 'feel' the speaker, the emotion, the pauses, the invisible smile, etc.
No doubt models will improve and will be harder to identify as AI generated, but for now, as with diffusion images, I still notice it and react by just moving on..
That kinda means the quality isn't great or amazing. Good TTS should be nearly or indistinguishable from a human speaker and should include emoting, natural pauses, etc