Comment by LoganDark
According to the article, nearly 50% of the dataset is synthetic (8T out of 17T tokens). I don't know what constitutes "a breadth of state-of-the-art rephrasing approaches", but I lack some confidence in models trained on LLM output, so I hope it wasn't that.
> but I lack some confidence in models trained on LLM output, so I hope it wasn't that.
That's misguided. Models have been trained on synthetic data for ~2+ years already. The "model collapse" myth is based on a very poor paper that got waaaay more attention than it deserved (because negativity sells, I guess). In practice every lab out there is doing this, because it works.