Comment by kevmo314
I have a sneaking suspicion it's because they lifted the model architecture almost directly from Parakeet: https://jordandarefsky.com/blog/2024/parakeet/
Parakeet references WhisperD which is at https://huggingface.co/jordand/whisper-d-v1a and doesn't include a full list of non-speech events that it's been trained with, except "(coughs)" and "(laughs)".
Not saying the authors didn't do anything interesting here. They put in the work to reproduce the blog post and open source it, a praiseworthy achievement in itself, and they even credit Parakeet. But they might not have the list of commands for more straightforward reasons.
You're absolutely right. We used Jordan's Whisper-D, and he was generous enough to offer some guidance along the way.
It's also a valid criticism that we haven’t yet audited the dataset for existing list of tags. That’s something we’ll be improving soon.
As for Dia’s architecture, we largely followed existing models to build the 1.6B version. Since we only started learning about speech AI three months ago, we chose not to innovate too aggressively early on. That said, we're planning to introduce MoE and Sliding Window Attention in our larger models, so we're excited to push the frontier in future iterations.