Comment by micw

Comment by micw 4 days ago

7 replies

With this technology, one could produce high quality audio books without having access to high quality narrators by annotating the books with the voice, speed and such things.

I wonder if a standardized markup exists to do so.

albert_e 4 days ago

There is SSML for speech markup to indicate various characters of speech like whispers, pronunciation, pace, emphasis, etc.

With LLMs proving to be very good at generating code, it may be reasonable to assume they can get good at generating SSML as well.

Not sure if there is a more direct way to channel the interpretation of the tone/context/emotion etc from prose into generated voice qualities.

If we train some models on ebooks along with their professionally produced human-narrated audiobooks, with enough variety and volume of training data, the models might capture the essence of that human-interpretation of written text? Just maybe?

Amazon with its huge collection of Audible + Kindle library -- if it can do this without violating any rights -- has a huge corpus for this. They already have "whispersync" which is a feature that syncs text in a kindle ebook with words in corresponding audible audiobook.

  • micw 4 days ago

    Good points, thank you! I just tested it. While ChatGPT was very good in adding generic (textual) annotations, the result for generating SSML where very poor (lack of voice names, lack of distinction between narrator and character etc).

    Probably the results with a model trained for this plus human audit could lead to very good results.

pegasus 4 days ago

They still wouldn't be high quality. It's just not possible to capture the precise tone of voice in an annotation, and that precision I believe really makes a difference. My experience is that the deeper the narrator understands the text and conveys that understanding, the easier it becomes for me to absorb that information.

  • vasco 4 days ago

    Have you tried those "podcast from a paper" models? They do some of the things you are saying they don't, although it's not 100% it's also miles ahead of for example human Polish TV lectors, or other monotone style narrations.

KeplerBoy 4 days ago

Don't end to end trained models already do this to some extent? Like raising the pitch towards a question mark, like a human would.

TortoiseTTS has a few examples under prompt engineering on their demo site: https://nonint.com/static/tortoise_v2_examples.html

  • micw 4 days ago

    That's a bit of basic and random. Some models have the features you describe. From the better models you get a slightly different voice for text in quotes.

    But the difference to good audio books is that you have * different voices for the narrator and each character * different emotions and/or speed in certain situations.

    I guess you could use a LLM to "understand" and annotate an existing book if there's a markup and then use TTS to create an audio book from it and so automate most of the the process.

    • micw 4 days ago

      Edit: I actually tried this. I prompted in ChatGPT:

      "Annotate the following text with speakers and emotions so that it can be turned into an audiobook via TTS", followed by a short text from "The Hobbit" (The "Good morning scene"). The result is very good.