Comment by toebee
You're absolutely right. We used Jordan's Whisper-D, and he was generous enough to offer some guidance along the way.
It's also a valid criticism that we haven’t yet audited the dataset for existing list of tags. That’s something we’ll be improving soon.
As for Dia’s architecture, we largely followed existing models to build the 1.6B version. Since we only started learning about speech AI three months ago, we chose not to innovate too aggressively early on. That said, we're planning to introduce MoE and Sliding Window Attention in our larger models, so we're excited to push the frontier in future iterations.
I’m curious what differentiates it from Parakeet? I was listening to some of the demos on the parakeet announcement and they sound very similar to your examples - are they trained on the same data? Are there benefits to using Dia over Parakeet?