Comment by micw

Comment by micw 6 months ago

With this technology, one could produce high quality audio books without having access to high quality narrators by annotating the books with the voice, speed and such things.

I wonder if a standardized markup exists to do so.

albert_e 6 months ago

There is SSML for speech markup to indicate various characters of speech like whispers, pronunciation, pace, emphasis, etc.

With LLMs proving to be very good at generating code, it may be reasonable to assume they can get good at generating SSML as well.

Not sure if there is a more direct way to channel the interpretation of the tone/context/emotion etc from prose into generated voice qualities.

If we train some models on ebooks along with their professionally produced human-narrated audiobooks, with enough variety and volume of training data, the models might capture the essence of that human-interpretation of written text? Just maybe?

Amazon with its huge collection of Audible + Kindle library -- if it can do this without violating any rights -- has a huge corpus for this. They already have "whispersync" which is a feature that syncs text in a kindle ebook with words in corresponding audible audiobook.

Reply View 1 reply

micw 6 months ago

Good points, thank you! I just tested it. While ChatGPT was very good in adding generic (textual) annotations, the result for generating SSML where very poor (lack of voice names, lack of distinction between narrator and character etc).
Probably the results with a model trained for this plus human audit could lead to very good results.

Reply View | 0 replies

pegasus 6 months ago

They still wouldn't be high quality. It's just not possible to capture the precise tone of voice in an annotation, and that precision I believe really makes a difference. My experience is that the deeper the narrator understands the text and conveys that understanding, the easier it becomes for me to absorb that information.

Reply View 1 reply

vasco 6 months ago

Have you tried those "podcast from a paper" models? They do some of the things you are saying they don't, although it's not 100% it's also miles ahead of for example human Polish TV lectors, or other monotone style narrations.

Reply View | 0 replies

KeplerBoy 6 months ago

Don't end to end trained models already do this to some extent? Like raising the pitch towards a question mark, like a human would.

TortoiseTTS has a few examples under prompt engineering on their demo site: https://nonint.com/static/tortoise_v2_examples.html

Reply View 2 replies

micw 6 months ago

That's a bit of basic and random. Some models have the features you describe. From the better models you get a slightly different voice for text in quotes.
But the difference to good audio books is that you have * different voices for the narrator and each character * different emotions and/or speed in certain situations.
I guess you could use a LLM to "understand" and annotate an existing book if there's a markup and then use TTS to create an audio book from it and so automate most of the the process.

Reply View | 1 reply
- micw 6 months ago
  
  Edit: I actually tried this. I prompted in ChatGPT:
  "Annotate the following text with speakers and emotions so that it can be turned into an audiobook via TTS", followed by a short text from "The Hobbit" (The "Good morning scene"). The result is very good.
  
  Reply View | 0 replies