Comment by micw
With this technology, one could produce high quality audio books without having access to high quality narrators by annotating the books with the voice, speed and such things.
I wonder if a standardized markup exists to do so.
With this technology, one could produce high quality audio books without having access to high quality narrators by annotating the books with the voice, speed and such things.
I wonder if a standardized markup exists to do so.
Good points, thank you! I just tested it. While ChatGPT was very good in adding generic (textual) annotations, the result for generating SSML where very poor (lack of voice names, lack of distinction between narrator and character etc).
Probably the results with a model trained for this plus human audit could lead to very good results.
They still wouldn't be high quality. It's just not possible to capture the precise tone of voice in an annotation, and that precision I believe really makes a difference. My experience is that the deeper the narrator understands the text and conveys that understanding, the easier it becomes for me to absorb that information.
Don't end to end trained models already do this to some extent? Like raising the pitch towards a question mark, like a human would.
TortoiseTTS has a few examples under prompt engineering on their demo site: https://nonint.com/static/tortoise_v2_examples.html
That's a bit of basic and random. Some models have the features you describe. From the better models you get a slightly different voice for text in quotes.
But the difference to good audio books is that you have * different voices for the narrator and each character * different emotions and/or speed in certain situations.
I guess you could use a LLM to "understand" and annotate an existing book if there's a markup and then use TTS to create an audio book from it and so automate most of the the process.
There is SSML for speech markup to indicate various characters of speech like whispers, pronunciation, pace, emphasis, etc.
With LLMs proving to be very good at generating code, it may be reasonable to assume they can get good at generating SSML as well.
Not sure if there is a more direct way to channel the interpretation of the tone/context/emotion etc from prose into generated voice qualities.
If we train some models on ebooks along with their professionally produced human-narrated audiobooks, with enough variety and volume of training data, the models might capture the essence of that human-interpretation of written text? Just maybe?
Amazon with its huge collection of Audible + Kindle library -- if it can do this without violating any rights -- has a huge corpus for this. They already have "whispersync" which is a feature that syncs text in a kindle ebook with words in corresponding audible audiobook.