Comment by mcswell
First, let me say that this is impressive. And then let me pose some questions:
As a linguist, I would like to know more about the kinds of languages this works well with, or does not work well with. For example, half the world's languages are tone languages, and the way tones work varies greatly among these. Some just have high and low tones, while others are considerably more complicated; Thai has high, mid, low, rising and falling. Also, tone is relative, e.g. a man's high tone might be a woman's low tone. And some African languages have tones whose absolute frequencies vary across an utterance. So transcribing tone is a quite different problem from transcribing phonemes--and yet for many tone languages, the tone is crucial.
There are also rare(r) phonemes, like the clicks in many languages of southern Africa. Of course maybe they've already trained on some of these languages.
The HuggingFace demo says "Supported Languages[:] For this public demo, we've restricted transcription to low-resource languages with error rates below 10%." That's unclear: 10% word error rate, or character/ phoneme error rate? The meta.com page refers to character error rate (CER); a 10% character error rate can imply a much higher word error rate (WER), since most words contain several characters/ phonemes. That said, there are ways to get around that, like using a dictionary to select among different paths through possible character sequences so you only get known words, and adding to that a morphological parser for languages that have lots of affixes (meaning not all the word forms will be in the dictionary--think walk, walks, walked, walking--only the first will be in most dictionaries.)
Enquiring minds want to know!
I'm not an expert but the rule of thumb is to expect something like this:
https://xkcd.com/1838/