Comment by delegate

Comment by delegate 4 days ago

7 replies

The quality is great (amazing even), but I can't listen to AI generated voices for more than 1 minute. I don't know why, I just don't like it. I immediately skip the video on youtube if the voice is AI generated.

Might be because our brains try to 'feel' the speaker, the emotion, the pauses, the invisible smile, etc.

No doubt models will improve and will be harder to identify as AI generated, but for now, as with diffusion images, I still notice it and react by just moving on..

rockemsockem 4 days ago

That kinda means the quality isn't great or amazing. Good TTS should be nearly or indistinguishable from a human speaker and should include emoting, natural pauses, etc

CMay 3 days ago

Haven't really been following the latest in TTS ML, but I expected this to be better or at least as good-bad as the stuff you hear on YouTube. Somehow it sounds worse. It really is jarring to listen to any of these ML voices and can't really stand it. Nope out of every video that uses them and can't tell if YouTube never recommends them to me for that reason, or just because the recommendations around what I watch are just so rarely going to be from some low reputation channel.

Take a moment here for a second though and think about it. Even if these voices got to be really good, indistinguishable almost... would I want to listen to it even then? If it was an NPC's generated voice and generated dialogue in a game to help enrich the world building, maybe in that context. On YouTube or with newscasters? Probably not. Audio books? Think I would still rather have it be a real person, because it's like they're reading a story to me and it feels better if it's coming from someone. There's also the unknown factor, where if it's ML generated it's so sterile that the unknowns are kind of gone.

Think about it like this, in the movie industry we had practical effects that were charming in a way. You could think about the physical things that had to occur to make that happen. Movie magic. Now, everything is so CG it's like the magic is gone. Even though you know people put serious hard work into it, there's a kind of inauthenticity and just lack of relevance to the real world that takes something away from it.

It's like a real magician has interesting tricks, while an artificial magician is most likely just a liar.

Still, I grant that it makes some cool things possible and there is potential if things are done right. Some positive mixture of real humans and machine generated stuff so it isn't devoid of anything connected to real life effort.

_DeadFred_ 3 days ago

For new generations/those coming up now this will be the norm and not generate the negative reaction is does for us, it will just be part of how the world is and has always been, and eventually we will be the minority.

Future generations will never know a world where you don't watch a 2 hour AI generated orientation video about the wonders of working for Generic Corp when you start a new job.

yjftsjthsd-h 3 days ago

> I immediately skip the video on youtube if the voice is AI generated.

I mean, I do that because it's correlated with the content being garbage. If I'm intentionally using it on content I want to consume I expect it to be different, though I haven't gotten around to trying it properly yet so I guess we'll see. (OTOH I already listen to ebooks via pre-AI TTS, so I'm optimistic)

xdennis 3 days ago

Among other things, what I don't like is the hallucinated stress. Take the classic example of:

> I never said she stole my money

It can have 7 different meanings based on which word you stress out.

The new AI voices sound very natural at a shallow level, but overall pronounce things in odd ways. Not quite wrong, but subtly unnatural which introduces some cognitive load.

Old TTS systems with their monotonic voices are less confusing, but sound very robotic.

karmasimida 3 days ago

Yeah same.

Doesn't mean the quality is bad. In fact I think Kokoro's quality is amazing.

But it is not the right tool for narration, the kind of training data they use make the sound too flat, if that makes sense.