Comment by HighFreqAsuka
Comment by HighFreqAsuka 21 hours ago
Take a look at The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text (https://arxiv.org/pdf/2506.05209). They build a reasonable 7B parameter model using only open-licensed data.
They mostly do that. They risked legal contamination by using Whisper-derived text and web text which might have gotchas. Other than that, it was a great collection for low-risk training.