Comment by hammadmlk
Not much... Just the willingness to work hard on this problem instead of others problems where large revenue is perhaps quicker :)
Ingredients: Decent audio scraping skills, hiring great voice actors for each language, algos to gather text/audio with diverse phonetics, decent ML skills (enough to merge the best features of a few different papers). Lots and lots of data labels (and your own tools to get the data labeled efficiently) And finally GPUs!!!!
None of this is technically hard... the hardest thing is working with Voice Actors (oh man!!!)