Comment by toebee

Interesting. I haven't thought of that problem before. I'm guessing a large enough audio dataset for medical terminology does not exist publicly.

But AFAIK, even if you have just a few hours of audio containing specific terminology (and correct pronunciation), fine-tuning on that data will significantly improve performance.