Comment by killerstorm
Comment by killerstorm 10 months ago
Speech is inherently easier to represent as a sequence of tokens than a high-resolution image.
Best speech to text is already NN transformer based anyway, so in theory it's only better to use a combined model