Comment by hedgehog

Here is the only legal efforts I know about that’s available in some way:

https://www.fairlytrained.org/

https://www.kl3m.ai/#features

Here’s a dataset that could be used for a public domain model:

https://www.tensorflow.org/datasets/catalog/pg19

If non-public domain, one can add in the code from The Stack. That would be tens of gigabytes of both English text and code. Then, third-party could add licensed, modern works to the model with further pre-training.

I also think a model trained on a large amount of public domain data would be good for experimentation with reproduceability. There would be no intellectual property issues in the reproduction of the results. Should also be useful in a lot of ways.