Comment by twoodfin

Comment by twoodfin a year ago

I’m not clear on the virtues or potential of a model like this over a pure text model using STT/TTS to achieve similar results.

Is the idea that as these models grow in sophistication they can properly interpret (or produce) inflection, cadence, emotion that’s lost in TTS?

a2128 a year ago

There's a lot of data loss and guessing with STT/TTS.

An STT model might misrecognize a word, but an audio LLM may understand the true word because of the broad context. A TTS model needs to guess the inflection and it can get it completely wrong, but an audio LLM could understand how to talk naturally and with what tone (e.g. use a higher tone if it's interjecting)

Speaking of interjection, an STT/TTS system will never interject because it relies on VAD and heuristics to guess when to start talking or when to stop, and generally the rule is to only talk after the user stopped talking. An audio LLM could learn how to conversate naturally, avoid taking up too much conversation time or even talk with a group of people.

An audio LLM could also produce music or sounds or tell you what the song is when you hum it. There's a lot of new possibility

I say "could learn" for most of this because it requires good training data, but from my understanding most of these are currently just trained with normal text datasets synthetically turned into voice with TTS, so they are effectively no better than a normal STT/TTS system; it's a good way to prove an architecture but it doesn't demonstrate the full capabilities

Reply View 5 replies

langcss a year ago

You need a lot more power. I found gpt4o struggles doing basic OCR of printed text by hallucinating alot, while tesseract engine (old skool) gets it perfect. You need the model to be powerful enough to do everything.
You can work around this by the way by sending the output through a checking stage.
So picture -> gpt4o -> out1, picture -> tesseract -> out2, out1,out2 -> llm.
Might work for sound too.

Reply View | 4 replies
- falcor84 a year ago
  
  Interesting, I've actually been using gpt4o extensively for OCR and didn't encounter any significant issues - could I ask you to please give an example of an image of (otherwise legible) text that it hallucinates on?
  
  Reply View | 2 replies
  
  langcss a year ago
  
  I'll send you an email.
  
  Reply View | 0 replies
  
  schrodinger a year ago
  
  Same, it's perfect at OCR. Generating an image with text in it however… nope!
  
  Reply View | 0 replies
- killerstorm a year ago
  
  Speech is inherently easier to represent as a sequence of tokens than a high-resolution image.
  Best speech to text is already NN transformer based anyway, so in theory it's only better to use a combined model
  
  Reply View | 0 replies

spuz a year ago

Personally, I'm very much looking forward to using a speech model like OpenAI's advanced voice mode to learn language. It can already do things like speak quickly or slowly which traditional TTS systems can't. Also, in theory a speech model could tell me if my pronunciation is accurate. It could correct me by repeating my incorrect pronunciation and then providing the correct pronunciation. I don't actually know how capable OpenAI's advanced voice mode is in this regard because I haven't seen anyone actually test this but I'm extremely curious to try it myself. If other voice models can achieve this then it will be an incredible tool for language learning.

Reply View 1 reply

paulryanrogers a year ago

Traditional TTS can certainly be cranked up in speed. Low/no vision users often listen at 2-3x.

Reply View | 0 replies

theptip a year ago

Lots has been written on this subject, check out OpenAI’s papers on -O for example.

Latency is a big one due to batching. You can’t really interrupt the agent, which makes actual conversation more clunky. And yes, multimodal has better understanding. (I haven’t seen analysis of perception of emotions, has anyone seen analysis of this capability for GPT-O?)

Reply View 0 replies

Reubend a year ago

Essentially, there's data loss from audio -> text. Sometimes that loss is unimportant, but sometimes it meaningfully improves output quality.

However, there are some other potential fringe benefits here: improving the latency of replies, improving speaker diarization, and reacting to pauses better for conversations.

Reply View 0 replies

fragmede a year ago

Really

Yeah that's the point. Without punctuation, no one can tell what inflection my "really" above should have, but even if it'd been "Really?" or "Really!", there's still room for interpretation. With a bet on voice interfaces needing a Google moment (wherein, prior to Google, search was crap) to truely become successful (by interpreting and creating inflection, cadence, emotion, as you mentioned), creating such a model makes a lot of sense.

Reply View 0 replies

bubaumba a year ago

> I’m not clear on the virtues or potential of a model like this over a pure text model

you can't put pure text with keyboard on a robot. it will become a wheeled computer.

actually this is a cool thing as a companion / assistant.

Reply View 0 replies