Comment by killerstorm

Comment by killerstorm a year ago

Speech is inherently easier to represent as a sequence of tokens than a high-resolution image.

Best speech to text is already NN transformer based anyway, so in theory it's only better to use a combined model