Comment by nickpsecurity

Comment by nickpsecurity 4 days ago

5 replies

I’m still waiting for a large, OSS one with 100% legal, pre-training data. We don’t even have a 1B model that I’m sure meets that standard. There’s a fair-trained model for lawyers claiming it.

I think someone running a bunch of epochs of a 30B or 70B on Project Gutenberg would be a nice start. We could do continued pre-training from there.

So, if counting legal and at least trainable (open weights), the performance can only go up from here.

copperx 4 days ago

I understand the desire, but most of the world's knowledge is under copyright. 100% legal will never give you the same performance.

  • nickpsecurity 3 days ago

    Both of your claims are true. That doesn’t justify breaking the law.

    I could likewise argues that most of the world money is in the hands of other people, I could perform more in the markets if I had it, and so I should just go take it. We still follow the law and respect others’ rights in spite of what acting morally cost us.

    The law abiding, moral choice is to do what we can within the law while working to improve the law. That means we use a combination of permissively licensed works and works to train our models. We also push for legislation that creates exceptions in copyright law for training machine learning models. We’re already seeing progress in Israel and Singapore on those.

  • mewpmewp2 4 days ago

    Meanwhile countries who whistle on that copyright would be able to gain a huge advantage.

hedgehog 4 days ago

Are you aware of any efforts to do this? Even a 3B param attempt would be informative.

  • nickpsecurity 3 days ago

    Here is the only legal efforts I know about that’s available in some way:

    https://www.fairlytrained.org/

    https://www.kl3m.ai/#features

    Here’s a dataset that could be used for a public domain model:

    https://www.tensorflow.org/datasets/catalog/pg19

    If non-public domain, one can add in the code from The Stack. That would be tens of gigabytes of both English text and code. Then, third-party could add licensed, modern works to the model with further pre-training.

    I also think a model trained on a large amount of public domain data would be good for experimentation with reproduceability. There would be no intellectual property issues in the reproduction of the results. Should also be useful in a lot of ways.