Comment by Tmpod
I've been interested in dictation for a while, but I don't want to be sending any audio to a remote API, it all has to be local. Having tried just a couple of models (namely the one used by the FUTO Keyboard), I'm kinda feeling like we're not quite there yet.
My biggest gripe perhaps is not being able to get decent content out of a thought stream; the models can't properly filter out the pauses, "uuuuhmms", and much less so handle on the fly corrections to what I've been saying, like going back and repeating something with a slight variation and whatnot.
This is a challenging problem I'd love to see being tackled well by open models I can run on my computer or phone. Are there new models more capable of this? Is it not just a model thing, and I missing a good app too?
In the meanwhile, I'll keep typing, even though it can be quite a bit less convenient to do; especially true for note taking on the go.
Have you tried Whisper itself? It's open-weights.
One of the features of the project posted above is "transformations" that you can run on transcripts. They feed the text into an LLM to clean it up. If you're willing to pay for the tokens, I think you could not only remove filler-words, but could probably even get the semantically-aware editing (corrections) you're talking about.