Comment by malevolent-elk

I've been playing around with this workflow too - I'm using a "streaming" setup with Whisper (chunking samples to start transcribing while a user is still talking), which pipes to Mistral 8B as a conversation arbiter to walk through a preset IVR tree which calls tools etc. The LLM isn't responding on its own though, just selecting nodes in the tree with canned TTS outputs.

There's a "pause length" parameter that tries to decide whether a user has finished talking before it passes transcripts to the LLM, nothing fancy. If you have any recs I'm still working through how to properly handle the audio input and whether a prompting setup can manage the LLM with enough fidelity to scrap the IVR tree. It works decently well, but lots of room for improvement