Comment by vessenes

Interesting. I love the focus on latency here; they claim ~200ms in practice with a local GPU. It's backed by a 7B transformer model, so it's not going to be super smart. If we imagine a 70B model has like 1s latency, then there's probably a systems architecture that's got 1 or 2 intermediate 'levels' of response, something to cue you verbally "The model is talking now," something that's going to give a quick early reaction (7B / Phi-3 sized), and then the big model. Maybe you'd have a reconciliation task for the Phi-3 model: take this actually correct answer, apologize if necessary, and so on.

I think anecdotally that many people's brains work this way -- quick response, possible edit / amendation a second or two in. Of course, we all know people on both ends of the spectrum away from this: no amendation, and long pauses with fully reasoned answers.