Comment by Reubend
Essentially, there's data loss from audio -> text. Sometimes that loss is unimportant, but sometimes it meaningfully improves output quality.
However, there are some other potential fringe benefits here: improving the latency of replies, improving speaker diarization, and reacting to pauses better for conversations.