Comment by cvzakharchenko
Comment by cvzakharchenko a day ago
So it's not STT -> LLM -> TTS? If I scream Chewbacca noises as input, will the model recognize it as nonsense, or will it interpret it with some lousy STT as some random words?
Comment by cvzakharchenko a day ago
So it's not STT -> LLM -> TTS? If I scream Chewbacca noises as input, will the model recognize it as nonsense, or will it interpret it with some lousy STT as some random words?
What about every movie that's been made? Although it might need to stick to those more than 100 yrs old to avoid copyright law?
I used to have fun with that. Set Google Translate to Chinese (Or some other language I don't speak, though tonal languages seemed to work better), make some vague noises into it, and get out coherent but crazy phrases in English.
It's not, but it probably won't recognize it as nonsense. According to the paper,
> we construct a dataset named InstructS2S-200K by rewriting existing text instruction data and performing speech synthesis
It has only been trained on questions spoken by TTS, it has never seen (heard) nonsense. Most likely it'll just hallucinate that you asked some question and it'll generate some answer instead of asking if you're good. There's just not many audio datasets with real voices, there's no audio version of StackOverflow to be scraped