Comment by InvisibleUp

Comment by InvisibleUp a day ago

If the output of this is even somewhat coherent, it would disprove the argument that mass amounts of copyrighted works are required to train an LLM. Unfortunately that does not appear to be the case here.

HighFreqAsuka 21 hours ago

Take a look at The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text (https://arxiv.org/pdf/2506.05209). They build a reasonable 7B parameter model using only open-licensed data.

Reply View 1 reply

nickpsecurity 18 hours ago

They mostly do that. They risked legal contamination by using Whisper-derived text and web text which might have gotchas. Other than that, it was a great collection for low-risk training.

Reply View | 0 replies