Comment by Retric
Comment by Retric 12 hours ago
I’ve yet to read an actual argument defending commercial LLM’s as fair use based on existing (edit:legal) criteria.
Comment by Retric 12 hours ago
I’ve yet to read an actual argument defending commercial LLM’s as fair use based on existing (edit:legal) criteria.
I think you may have something with that line of reasoning.
The threshold for transformative for fictional works is fairly high unfortunately. Fan fiction and reasonably distinct works with excessive inspiration are both copyright infringing. https://en.wikipedia.org/wiki/Tanya_Grotter
> Models themselves are very clearly transformative.
A near word for word copy of large sections of a work seems nowhere near that threshold. An MP3 isn’t even close to a 1:1 copy of a piece of music but the inherent differences are irrelevant, a neural network containing and allowing the extraction of information looks a lot like lossy compression.
Models could easily be transformative, but the justification needs to go beyond well obviously they are.
Models are not word for word copies of large sections of text. They are capable of emitting that text though.
It would be interesting to look at what legal precidents were set regarding mp3s or other encodings. Is the encoding itself an infringement, or is it the decoding, or is it the distribution of a decodable form of a work.
There is also the distinction with a lossy encoding that encodes a single work. There is clarity when the encoded form serves no other purpose other than to be decoded into a given work. When the encoding acts as a bulk archive, does the responsibility shift to those who choose what to extract from the archive?
> Is the encoding itself an infringement
Barring a fair use exception, yes.
From what I’ve read MP3’s get the same treatment as cassette tapes which were also lossy. It’s 1:1 digital copies that represented some novelty, but that rarely matters.
I’m hesitant to comment of the rest of that. The ultimate question isn’t if some difference exists but why that difference matters.
Training itself involves making infringing copies of protected works. Whether or not inference produces copyrighted material is almost beside the point.
Only as long as it's not copied again during training. You can't make copies of your purchased digital copy for any reason other than archival.
If you take that approach to fair use, don't you open the door to the same argument for copyright itself?
How do you distinguish between a tool and the director of a tool? I doubt people would say that a person is immune to copyright or fair use rules because it was the pen that wrote the document, not the person.
If you really haven't read a single argument about it then you're deliberately blocking them out, because it just takes a couple minutes of searching.
https://www.arl.org/blog/training-generative-ai-models-on-co...
https://hls.harvard.edu/today/does-chatgpt-violate-new-york-...
https://www.bakerdonelson.com/artificial-intelligence-and-co...
https://www.techpolicy.press/to-support-ai-defend-the-open-i...
Those support the utility or debate individual points but don’t make a coherent argument that LLM are strictly fair use.
First link provides quotes but doesn’t actually make an argument that LLM’s are fair use under current precedent. Rather that training AI can be fair use and researchers would like LLM’s to include copyrighted works to aid research on modern culture. The second article goes into depth but isn’t a defense of LLM’s. If anything they suggest a settlement is likely. The final instead argues for the utility of LLM’s, which is relevant but doesn’t rely on existing precedent, the court could rule in favor of some mandatory licensing scheme for example.
The third gets close: “We expect AI companies to rely upon the fact that their uses of copyrighted works in training their LLMs have a further purpose or different character than that of the underlying content. At least one court in the Northern District of California has rejected the argument that, because the plaintiffs' books were used to train the defendant’s LLM, the LLM itself was an infringing derivative work. See Kadrey v. Meta Platforms, Case No. 23-cv-03417, Doc. 56 (N.D. Cal. 2023). The Kadrey court referred to this argument as "nonsensical" because there is no way to understand an LLM as a recasting or adaptation of the plaintiffs' books. Id. The Kadrey court also rejected the plaintiffs' argument that every output of the LLM was an infringing derivative work (without any showing by the plaintiffs that specific outputs, or portion of outputs, were substantially similar to specific inputs). Id.”
Very relevant, but runs into issues when large sections can be recovered and people do use them as substitutes for the original work.
It seems like a pretty reasonable argument and easy enough to make. A human with a great memory could probably recreate some absurd % of Harry Potter after reading it, there are some very unusual minds out there. It is clear that if they read Harry Potter and <edit> being capable </edit> of reproducing it on demand as a party trick that would be fair use. So the LLM should also be fair use since it is using a mechanism similar enough to what humans do and what humans do is fine.
The LLMs I've used don't randomly start spouting Harry Potter quotes at me, they only bring it up if I ask. They aren't aiming to undermine copyright. And they aren't a very effective tool for it compared to the very well developed networks for pirating content. It seems to be a non-issue that will eventually be settled by the raw economic force that LLMs are bringing to bear on society in the same way that the movie industry ultimately lost the battle against torrents and had to compete with them.
> is clear that if they read Harry Potter and reproduce it on demand as a party trick that would be fair use.
Actually no that could be copyright infringement. Badly signing a recent pop song in public also qualifies as copyright infringement. Public performances count as copying here.
> Badly signing a recent pop song in public also qualifies as copyright infringement
For commercial purposes only. If someone sells a recreation of the Harry Potter book, it’s illegal regardless whether it was by memory, directly copying the book, or using an LLM. It’s the act of broadcasting it that’s infringing on copyright, not the content itself.
There’s a bunch of nuance here.
But just for clarification, selling a recreation isn’t required for copyright infringement. The copying itself can be problematic so you can’t defend yourself by saying you haven’t yet sold any of the 10,000 copies you just printed. There are some exceptions that allow you to make copies for specific purposes, skip protection on a portable CD player for example, but that doesn’t apply to the 10k copies situation.
Ah sorry. I mistyped. Being able to do that it would be fair use. I went back and fixed the comment.
Although frankly, as has been pointed out many times, the law is also stupid in what it prohibits and that should be fixed first as a priority. Its done some terrible damage to our culture. My family used to be part of a community choir until it shut down basically for copyright reasons.
The difference might be the "human doing it as a party trick" vs "multi billion dollar corporation using it for profit".
Having said that I think the cat is very much out of the bag on this one and, personally, I think that LLMs should be allowed to be trained on whatever.
> It is clear that if they read Harry Potter and <edit> being capable </edit> of reproducing it on demand as a party trick that would be fair use.
Not fair use. No one would ever prosecute it as infringement but it's not fair use.
I'm fairly sure that the law treats humans and machines differently, so arguing that it would be OK if a person did it therefore it's OK to build a machine that does it is not very helpful. (I'm not sure you're doing that but lots of random non-lawyers on the Internet seem to be doing that.)
Claims like this demonstrate it, really: it is obviously not copyright infringement for a human to memorise a poem and recite it in private; it obviously is copyright infringement to build a machine that does that and grant public access to that machine. (Or does anyone think that's not obvious?)
> A human with a great memory
This kind of argument keeps popping up usually to justify why training LLMs on protected material is fair, and why their output is fair. It's always used in a super selective way, never accounting for confounding factors, just because superficially it sort of supports that idea.
Exceptional humans are exceptional, rare. When they learn, or create something new based on prior knowledge, or just reproduce the original they do it with human limitations and timescales. Laws account for these limitations but still draw lines for when some of this behavior is not permitted.
The law didn't account for a computer "software" that can ingest the entirety of human creation that no human could ever do, then reproduce the original or create an endless number of variations in a blink of an eye.
Nobody in real life thinks humans and machines are the same thing and actually believes they should have the same legal status. The A.I. enthusiast would not support the legality of shooting them when no longer useful the way a company would shred an old hard drive.
This supposed failure to see the difference between the human mind and a machine whenever someone brings up copyright is peformative and disingenuous.
> Nobody in real life thinks humans and machines are the same thing
Maybe you've been following a different conversation, or jumping to conclusions is just more convenient. This isn't about "legal status of AI" but about laws written having in mind only the capabilities of humans, at a time when systems as powerful as today's were unthinkable. Obviously the same laws have to set different limits for humans and machines.
There's no law limiting a human's top (running) speed but you have speed limits for cars. Maybe you're legally allowed to own a semi-automatic weapon but not an automatic one. This is the ELI5 for why when legislating, capabilities make all the difference. Obviously a rifle should not have the same legal status or be the same thing as a human, just in case my point is still lost on you.
Literally every single discussion on this LLM training/output topic, this one included, eventually has a number of people basing their argument on "but humans are allowed to do it", completely ignoring that humans can only do it in a much, much more limited way.
> is peformative and disingenuous
That's an extremely uncharitable and aggressive take, especially after not bothering to understand at all what I said.
>That's an extremely uncharitable and aggressive take, especially after not bothering to understand at all what I said.
To be clear, my intent wasn't to say you were the one being performative and disingenuous. I was referring to the sort of person you were debating against, the one who thinks every legal issue involving A.I. can be settled by typing "humans are allowed to do it."
Since I replied to you, I can see how what I wrote was confusing. My apologies.
The parent you replied to claimed LLMs are using "mechanism similar enough to what humans do and what humans do is fine."
Parent probably doesn't want his or her brain shredded like an old hard drive despite claiming similar mechanisms whenever it is convinient.
I'm arguing nobody actually believes there are "similar mechanisms" between machines and humans in their revealed preferences in day to day life.
>There's no law limiting a human's top (running) speed but you have speed limits for cars. Maybe you're legally allowed to own a semi-automatic weapon but not an automatic one.
I don't believe this analogy works. If we're talking about transmitting the text of Harry Potter, I believe it would already be illegal for a single human to type it on demand as a service.
If we are talking about remembering the text of Harry Potter but not reciting it on demand, that's not illegal for a human because copyright doesn't govern human memories.
I don't see what copyright law you think needs updating.
That’s why the “transformative” argument falls so flat to me. It’s about transformation in the mind and hands of a human.
Traditionally tools that reduce the friction of creating those transformations make a work less “transformed” in the eyes of the law, not more so. In this case the transformation requires zero mental or physical effort.
I’m looking for a link that does something like this but ends up supporting commercial LLM’s
https://copyrightalliance.org/faqs/what-is-fair-use/
The purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes; (commercial least wiggle room) The nature of the copyrighted work; (fictional work least wiggle room) The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book) and The effect of the use upon the potential market for or value of the copyrighted work. (Best argument as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )
Those aren’t the only factors, but I’m more interested in the counter argument here than trying to say they are copyright infringing.
Copyright notices in books make it absolutely clear - you are not allowed to acquire a text by copying it without authorisation.
If you photocopy a book you haven't paid for, you've infringed copyright. If you scan it, you've infringed copyright. If you OCR the scan, you've infringed copyright.
There's legal precedent in going after torrenters and z-lib etc.
So when Zuckerberg told the Meta team to do the same, he was on the wrong side of precedent.
Arguing otherwise is literally arguing that huge corporations are somehow above laws that apply to normal people.
Obviously some people do actually believe this. Especially the people who own and work for huge corporations.
But IMO it's far more dangerous culturally and politically than copyright law is.
For this part in particular:
> The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book)
For AI models as they currently exist… I'm not sure about typical or average, but Llama 3 is 15e12 tokens for all models sizes up to 409 billion parameters (~37 tokens per parameter), so a 100,000 token book (~133,000 words) is effectively contributing about 2700 parameters to the whole model.
The *average* book is condensed into a summary of that book, and of the style of that book. This is also why, when you ask a model for specific details of stuff in the training corpus, what you get back *usually* normally only sound about right rather than being an actual quote, and why LLMs need to have access to a search engine to give exact quotes — the exceptions are things that been quoted many many times like the US constitution or, by the look of things from this article, widely pirated books where there's a lot of copies.
Mass piracy leading to such infringement is still bad, but I think the reasons why matter: Given Meta is accused of mass piracy to get the training set for Llama, I think they're as guilty as can be, but if this had been "we indexed the open internet, pirate copies were accidental", this would be at least a mitigation.
(There's also an argument for "your writing is actually very predictable"; I've not read the HP books myself, though (1) I'm told the later ones got thicker due to repeating exposition of the previous books, and (2) a long-running serialised story I read during the pandemic, The Deathworlders, became very predictable towards the end, so I know it can happen).
Conversely, for this part:
> The effect of the use upon the potential market for or value of the copyrighted work. (Best argument but as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )
The current uses alone should make it clear that the effect on the potential market is catastrophic, and not just for existing works but also for not-yet-written ones.
People are using them to write blogs (directly from the LLM, not a human who merely used one as a copy-editor), and to generate podcasts (some have their own TTS, but that's easy anyway). My experiments suggest current models are still too flawed to be worth listening to them over e.g. the opinion of a complete stranger who insists they've "done their own research": https://github.com/BenWheatley/Timeline-of-the-near-future
LLMs are not yet good enough to write books, but I have tried using them to write short stories to keep track of capabilities, and o1 is already better than similar short stories on Reddit (not "good", just "better"): https://github.com/BenWheatley/Studies-of-AI/blob/main/Story...
But things do change, and I fully expect the output of various future models (not necessarily Transformer based) to increase the fraction of humans whose writings they surpass. I'm not sure what counts as "professional writer", but the U.S. Bureau of Labor Statistics says there's 150,000 "Writers and Authors"* out of a total population of about 340 million, so when AI is around the level of the best 0.04% of the population then it will start cutting into such jobs.
On the basis that current models seem (to me) to write software at about the level of a recent graduate, and with the potentially incorrect projection that this is representative across domains, and there are about 1.7 million software developers and 100k new software developer graduates each year, LLMs today would be be around the 100k worst of the 1.7 million best out of 340 million people — i.e. all software developers are the top 0.5% of the population, LLMs are on-par with the bottom 0.03 of that. (This says nothing much about how soon the models will improve).
But of course, some of that copyrighted content is about software development, and we're having conversations here on HN about the trouble fresh graduates are having and if this is more down to AI, the change of US R&D taxation rules (unlikely IMO, I'm in Germany and I think the same is happening here), or the global economy moving away from near-zero interest rates.
* https://www.bls.gov/ooh/media-and-communication/writers-and-...
Based upon legal decisions in the past there is a clear argument that the distinction for fair use is whether a work is substantially different to another. You are allowed to write a book containg information you learned about from another book. There is threshold in academia regarding plagiarism that stands apart from the legal standing. The measure that was used in Gyles v Wilcox was if the new work could substitute for the old. Lord Hardwicke had the wisdom to defer to experts in the field as to what the standard should be for accepting something as meaningfully changed.
Recent decisions such as Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith have walked a fine line with this. I feel like the supreme court got this one wrong because the work is far more notable as a Warhol than as a copy of a photograph, perhaps that substitution rule should be a two way street. If the original work cannot substitute for the copy, then clearly the copy must be transformative.
LLMs generating works verbatim might be an infringement of copyright (probably not), distributing those verbatim works without a licence certainly would be. In either case, it is probably considered a failure of the model, Open AI have certainly said that such reproductions shouldn't happen and they consider it a failure mode when it does. I haven't seen similar statements from other model producers, but it would not surprise me if this were the standard sentiment.
Humans looking at works and producing things in a similar style is allowed, indeed this is precisely what art movements are. The same transformative threshold applies. If you draw a cartoon mouse, that's ok, but if people look at it and go "It's Mickey mouse" then it's not. If it's Mickey to tiki Tu meke, it clearly is Mickey but it is also clearly transformative.
Models themselves are very clearly transformative. Copyright itself was conceived at a time when generated content was not considered possible so the notion of the output of a transformative work being a non transformative derivative of something else was never legally evaluated.