Comment by a2128

It's not, but it probably won't recognize it as nonsense. According to the paper,

> we construct a dataset named InstructS2S-200K by rewriting existing text instruction data and performing speech synthesis

It has only been trained on questions spoken by TTS, it has never seen (heard) nonsense. Most likely it'll just hallucinate that you asked some question and it'll generate some answer instead of asking if you're good. There's just not many audio datasets with real voices, there's no audio version of StackOverflow to be scraped