Comment by oezi

Comment by oezi 8 hours ago

0 replies

Unfortunately I don't read anything in the paper about improvements to timing/timestamping. In particular unclean word boundaries are hard with wav2vev2.

And their use of LLMs as part of the transcription process makes it likely that they trained the model to correct mispronounciations by the speaker. This lowers CER because the human transcription often corrects for mispronounciations as well, but reduces the ability of the model to actually transcribe what was said.