Comment by artsalamander
Comment by artsalamander 15 hours ago
I've been building solutions for real-time voice -> llm -> voice output, and I think the most exciting part of what you're building is the streaming neural audio codec since you're never actually really able to stream STT with whisper.
However from a product point of view I wouldn't necessarily want to pipe that into an LLM and have it reply, I think in a lot of use-cases there needs to be a tool/function calling step before a reply. Down to chat with anyone reading this who is working along these lines!
edit: tincans as mentioned below looks excellent too
editedit: noooo apparently tincans development has ended, there's 10000% space for something in this direction - Chris if you read this please let me pitch you on the product/business use-cases this solves regardless of how good llms get...
> there needs to be a tool/function calling step before a reply
I built that almost exactly a year ago :) it was good but not fast enough - hence building the joint model.