Comment by stuffoverflow

Comment by stuffoverflow 18 hours ago

This seems like a massive improvement for openly available local ASR. Even the 300M model outperforms whisper-large-v3 according to the paper's benchmarks.

lostmsu 17 hours ago

Not sure, I recorded 3 seconds of voice (a single sentence) and the hf demo misrecognized about half of the words.

Reply View 3 replies

nshm 11 hours ago

This model is actually expected to be bad for popular languages, just like previous MMS it is not accurate at all, it wins by supporting something rare well but never had good ASR accuracy even for Swedish etc. It is more a research thing than a real tool. Unlike Whisper.

Reply View | 0 replies
nshm 11 hours ago

And moreover, you can not tune those models for practical applications. The model is originally trained on very clean data, so lower layers are also not very stable for diverse inputs. To finetune you have to update the whole model, not just upper layers.

Reply View | 1 reply
- yorwba 3 hours ago
  
  In section 5.7.5, they fine-tune for "11 low-resource languages, with between 5-10 hours of training data and at least 1 hour of validation splits." "CTC fine-tuning takes ≈1 hour of walltime on 32 GPUs for the 300M scale." If that's too expensive, you also have the option of supplying additional context for the LLM-based model (section 5.5).
  As for "very clean data," see section 5.7.4: "Omnilingual + OMSF ASR was intentionally curated to represent naturalistic (i.e., often noisy) audio conditions, diverse speaker identities, and spontaneous, expressive speech."
  
  Reply View | 0 replies