Comment by killerstorm
Comment by killerstorm a day ago
Speech is inherently easier to represent as a sequence of tokens than a high-resolution image.
Best speech to text is already NN transformer based anyway, so in theory it's only better to use a combined model