Comment by DiabloD3

Comment by DiabloD3 7 hours ago

2 replies

Yes! All of those things DO pose existential copyright risks if they use them to violate copyright!. We're both on the same page.

If you have a VHS deck, copy a VHS tape, then start handing out copies of it, I pick up a copy of it from you, and then see, lo and behold, it contains my copyrighted work, I have sufficient proof to sue you and most likely win.

If you train an LLM on pirated works, then start handing out copies of that LLM, I pick up a copy of it, and ask it to reproduce my work, and it can do so, even partially, I have sufficient proof to sue you and most likely win.

Technically, even involving "which license" is a bit moot, AGPLv3 or not, its a copyright violation to reproduce the work without license. GPL just makes the problem worse for them: anything involving any flavor of GPLv3 can end up snowballing with major GPL rightsholders enforcing the GPLv3 curing clause, as they will most likely also be able to convince the LLM to reproduce their works as well.

The real TL;DR is: they have not discovered an infinite money glitch. They must play by the same rules everyone else does, and they are not warning their users of the risk of using these.

BTW, if I was wrong about this, (IANAL after all), then so are the legal departments at companies across the world. Virtually all of them won't allow AGPLv3 programs in the door just because of the legal risk, and many of them won't allow the use of LLMs with the current state of the legal landscape.

Workaccount2 6 hours ago

I think you are confused about how LLMs train and store information. These models aren't archives of code and text, they are surprisingly small, especially relative to the training dataset.

A recent anthropic lawsuit decision also reaffirms that training on copyright is not a violation of copyright.[1]

However outputting copyright still would be a violation, the same as a person doing it.

Most artists can draw a batman symbol. Copyright means they can't monetize that ability. It doesn't mean they can't look at bat symbols.

[1]https://www.npr.org/2025/06/25/nx-s1-5445242/federal-rules-i...

  • DiabloD3 2 hours ago

    No, I'm quite aware of how LLMs work. They are statistical models. They have, however, already been caught reproducing source material accurately. There is, inherently, no way to actually stop that if the only training data for a given output is a limited set of inputs. LLMs can and do exhibit extreme overfitting.

    As for the Anthropic lawsuit, the piracy part of the case is continuing. Most models are built with pirated or unlicensed inputs. The part that was decided on, although the decision imo was wrong, only covers if someone CAN train a model.

    At no point have I claimed you can't train one. The question is can you distribute one, and then use one. An LLM is not simplistic enough to be considered a phonebook, so they can't just handwave that away.

    Saying an LLM can do that is like saying an artist can make a JPEG of a Batman symbol, and that's totally okay for them to distribute because the JPEG artifacts are transformative. LLMs ultimately are just a clever way of compressing data, and compressors are not transformative under the law, but possessing a compressor is not inherently illegal, nor is using one on copyrighted material for your own personal use.