Comment by nickpsecurity
Comment by nickpsecurity 4 days ago
I’m still waiting for a large, OSS one with 100% legal, pre-training data. We don’t even have a 1B model that I’m sure meets that standard. There’s a fair-trained model for lawyers claiming it.
I think someone running a bunch of epochs of a 30B or 70B on Project Gutenberg would be a nice start. We could do continued pre-training from there.
So, if counting legal and at least trainable (open weights), the performance can only go up from here.
I understand the desire, but most of the world's knowledge is under copyright. 100% legal will never give you the same performance.