Comment by armcat

Comment by armcat 4 days ago

2 replies

This is so awesome, well done LemonSlice team! Super interesting on the ASR->LLM->TTS pipeline, and I agree, you can make it super fast (I did something myself as a 2-hour hobby project: https://github.com/acatovic/ova). I've been following full-duplex models as well and so far couldn't get even PersonaPlex to run properly (without choppiness/latency), but have you peeps tried Sesame, e.g. https://app.sesame.com/?

I played around with your avatars and one thing that it lacks is that it's "not patient", it's rushing the user, so maybe something to try and finetune there? Great work overall!

andrew-w 4 days ago

Thank you! Impressive demo with OVA. Still feels very snappy, even fully local. It will be interesting to see how video plays out in that regard. I think we're still at least a year away from the models being good enough and small enough that they can run on consumer hardware. We compared 6 of the major voice providers on TTFB, but didn't try Sesame -- we'll need to give that one a look. https://docs.google.com/presentation/d/18kq2JKAsSahJ6yn5IJ9g...

lcolucci 4 days ago

This is good feedback thanks! The "not patient" feeling probably comes from our VAD being set to "eager mode" so that the latency is better. VAD (i.e. deciding when the human has actually stopped talking) is a tough problem in all of voice AI. It basically adds latency to whatever your pipeline's base latency is. Speech2Speech models are better at this.