Comment by pstroqaty
If anyone's interested in a janky-but-works-great dictation setup on Linux, here's mine:
On key press, start recording microphone to /tmp/dictate.mp3:
# Save up to 10 mins. Minimize buffering. Save pid
ffmpeg -f pulse -i default -ar 16000 -ac 1 -t 600 -y -c:a libmp3lame -q:a 2 -flush_packets 1 -avioflags direct -loglevel quiet /tmp/dictate.mp3 &
echo $! > /tmp/dictate.pid
On key release, stop recording, transcribe with whisper.cpp, trim whitespace and print to stdout: # Stop recording
kill $(cat /tmp/dictate.pid)
# Transcribe
whisper-cli --language en --model $HOME/.local/share/whisper/ggml-large-v3-turbo-q8_0.bin --no-prints --no-timestamps /tmp/dictate.mp3 | tr -d '\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//'
I keep these in a dictate.sh script and bind to press/release on a single key. A programmable keyboard helps here. I use https://git.sr.ht/%7Egeb/dotool to turn the transcription into keystrokes. I've also tried ydotool and wtype, but they seem to swallow keystrokes. bindsym XF86Launch5 exec dictate.sh start
bindsym --release XF86Launch5 exec echo "type $(dictate.sh stop)" | dotoolc
This gives a very functional push-to-talk setup.I'm very impressed with https://github.com/ggml-org/whisper.cpp. Transcription quality with large-v3-turbo-q8_0 is excellent IMO and a Vulkan build is very fast on my 6600XT. It takes about 1s for an average sentence to appear after I release the hotkey.
I'm keeping an eye on the NVidia models, hopefully they work on ggml soon too. E.g. https://github.com/ggml-org/whisper.cpp/issues/3118.
before i even bother opening that github: does it work on windows? so far, all of the whisper "clones" run poorly, if at all, on windows. I do have a 3060 and a 1070ti i could use just for whisper on linux, but i have this 3090 on my windows desktop that works "fine" for whisper, TTS, SD, LLM.
Whisper on windows, the openai-whisper, doesn't have these q8_0 models, it has like 8 models, and i always get an error about triton cores (something about timeptamping i guess), which windows doesn't have. I've transcribed >1000 hours of audio with this setup, so i'm used to the workflow.