Comment by a2128

Comment by a2128 a day ago

5 replies

There's a lot of data loss and guessing with STT/TTS.

An STT model might misrecognize a word, but an audio LLM may understand the true word because of the broad context. A TTS model needs to guess the inflection and it can get it completely wrong, but an audio LLM could understand how to talk naturally and with what tone (e.g. use a higher tone if it's interjecting)

Speaking of interjection, an STT/TTS system will never interject because it relies on VAD and heuristics to guess when to start talking or when to stop, and generally the rule is to only talk after the user stopped talking. An audio LLM could learn how to conversate naturally, avoid taking up too much conversation time or even talk with a group of people.

An audio LLM could also produce music or sounds or tell you what the song is when you hum it. There's a lot of new possibility

I say "could learn" for most of this because it requires good training data, but from my understanding most of these are currently just trained with normal text datasets synthetically turned into voice with TTS, so they are effectively no better than a normal STT/TTS system; it's a good way to prove an architecture but it doesn't demonstrate the full capabilities

langcss a day ago

You need a lot more power. I found gpt4o struggles doing basic OCR of printed text by hallucinating alot, while tesseract engine (old skool) gets it perfect. You need the model to be powerful enough to do everything.

You can work around this by the way by sending the output through a checking stage.

So picture -> gpt4o -> out1, picture -> tesseract -> out2, out1,out2 -> llm.

Might work for sound too.

  • falcor84 a day ago

    Interesting, I've actually been using gpt4o extensively for OCR and didn't encounter any significant issues - could I ask you to please give an example of an image of (otherwise legible) text that it hallucinates on?

    • schrodinger 20 hours ago

      Same, it's perfect at OCR. Generating an image with text in it however… nope!

  • killerstorm a day ago

    Speech is inherently easier to represent as a sequence of tokens than a high-resolution image.

    Best speech to text is already NN transformer based anyway, so in theory it's only better to use a combined model