Comment by tomp
The problem with all these speech-to-speech multi-modal models is that, if you wanna do anything other than just talk, you need transcription.
So you're back at square one.
Current AI (even GPT-4o) simply isn't capable enough to do useful stuff. You need to augment it somehow - either modularize it, or add RAG, or similar - and for all of those, you need the transcript.
> Current AI (even GPT-4o) simply isn't capable enough to do useful stuff. You need to augment it somehow - either modularize it, or add RAG, or similar
I am sympathetic to this view but strongly disagree that you need a transcript. Think about it a bit more!!