Comment by freejazz
Comment by freejazz 13 hours ago
How do they get to the conclusion that AI uses are protected under the fair use doctrine and anything otherwise would be an "expansion" of copyright? Fairly telling IMO
Comment by freejazz 13 hours ago
How do they get to the conclusion that AI uses are protected under the fair use doctrine and anything otherwise would be an "expansion" of copyright? Fairly telling IMO
Most important part of fair use is does it harm the market for the original work. Search helps to brings more eyes to the original work, llms don't.
The fair use test (in US copyright law) is a 4 part test under which impact on the market for the original work is one of 4 parts. Notably, just because a use has massively detrimental harms to a work's market does not in and of itself constitute a copyright violation. And it couldn't be any other way. Imagine if you could be sued for copyright infringement for using a work to criticize that work or the author of that work if the author could prove that your criticism hurt their sales. Imagine if you could be sued for copyright infringement because you wrote a better song or book on the same themes as a previous creator after seeing their work and deciding you could do it better.
Perhaps famously, emulators very clearly and objectively impact the market for a game consoles and computers and yet they are also considered fair use under US copyright law.
No one part of the 4 part test is more important than the others. And so far in the US, training and using an LLM has been ruled by the courts to be fair use so long as the materials used in the training were obtained legally.
> And so far in the US, training and using an LLM has been ruled by the courts to be fair use so long as the materials used in the training were obtained legally.
Just like OpenAI is rightfully upset if their LLM output is used to train a competitor’s model and might seek to restrict it contractually, publishers too may soon have EULAs just for reading their books.
1. Character of the use. Commercial. Unfavorable.
2. Nature of the work. Imaginative or creative. Unfavorable.
3. Quantity of use. All of it. Unfavorable.
4. Impact on original market. Direct competition. Royalty avoidance. Unfavorable.
Just because the courts have not done their job properly does not mean something illegal is not happening.
All of these apply to emulators.
* The use is commercial (a number of emulators are paid access, and the emulator case that carved out the biggest fair use space for them was Connectix Virtual Game Station a very explicitly commercial product)
* The nature of the work is imaginative and creative. No one can argue games and game consoles aren't imaginative and creative works.
* Quantity of use. A perfect emulator must replicate 100% of the functionality of the system being emulated, often times including bios functionality.
* Impact on market. Emulators are very clearly in direct competition with the products they emulate. This was one of Sony's big arguments against VGS. But also just look around at the officially licensed mini-retro consoles like the ones put out by Nintento, Sony and Atari. Those retro consoles are very clearly competing with emulators in the retro space and their sales were unquestionably affected by the existence of those emulators. Royalty avoidance is also in play here since no emulator that I know of pays licensing fees to Nintendo or Sony.
So are emulators a violation of copyright? If not, what is the substantial difference here? An emulator can duplicate a copyrighted work exactly, and in fact is explicitly intended to do so (yes, you can claim its about the homebrew scene, and you can look at any tutorial on setting up these systems on youtube to see that's clearly not what people want to do with them). Most of the AI systems are specifically programmed to not output copyrighted works exactly. Imagine a world where emulators had hash codes for all the known retail roms and refused to play them. That's what AI systems try to do.
Just because you have enumerated the 4 points and given 1 word pithy arguments for something illegal happening does not mean that it is. Judge Alsup laid out a pretty clear line of reasoning for why he reached the decision he did, with a number of supporting examples [1]. It's only 32 pages, and a relatively easy read. He's also the same judge that presided over the Oracle v. Google cases that found Google's use of the java APIs to be fair use despite that also meeting all 4 of your descriptions. Given that, you'll forgive me if I find his reasoning a bit more persuasive than your 52 word assertion that something illegal is happening.
[1]: https://fingfx.thomsonreuters.com/gfx/legaldocs/jnvwbgqlzpw/...
It seems like you're responding to a question about training by talking about inference. If you train an LLM because you want to use it to do sentiment analysis to flag social media posts for human review, or Facebook trains one and publishes it and others use it for something like that, how is that doing anything to the market for the original work? For that matter, if you trained an LLM and then ran out of money without ever using it for anything, how would that? It should be pretty obvious that the training isn't the part that's doing anything there.
And then for inference, wouldn't it depend on what you're actually using it for? If you're doing sentiment analysis, that's very different than if you're creating an unlicensed Harry Potter sequel that you expect to run in theaters and sell tickets. But conversely, just because it can produce a character from Harry Potter doesn't mean that couldn't be fair use either. What if it's being used for criticism or parody or any of the other typical instances of fair use?
The trouble is there's no automated way to make a fair use determination, and it really depends on what the user is doing with it, but the media companies are looking for some hook to go after the AI companies who are providing a general purpose tool instead of the subset of their "can't get blood from a stone" customers who are using that tool for some infringing purpose.
re ".....AI training and the thing search engines do to make a search index are essentially the same thing. ...."
Well, AI training has annoyed LOTS people. Overloaded websites.. Done things just because they can . ie Facebook sucking up content of lots pirate books
Since this AI race started our small website is constantly over run by bots and it is not usable by humans because of the load.. NEWER HAD this problem before AI , when just access by search engine indexing .....
This is largely because search engines are a concentrated market and AI training is getting done by everybody with a GPU.
If Google, Bing, Baidu and Yandex each come by and index your website, they each want to visit every page, but there aren't that many such companies. Also, they've been running their indexes for years so most of the pages are already in them and then a refresh is usually 304 Not Modified instead of them downloading the content again.
But now there are suddenly a thousand AI companies and every one of them wants a full copy of your site going back to the beginning of time while starting off with zero of them already cached.
Ironically copyright is actually making this worse, because otherwise someone could put "index of the whole web as of some date in 2023" out there as a torrent and then publish diffs against it each month and they could all go download it from each other instead of each trying to get it directly from you. Which would also make it easier to start a new search engine.
There was also the lawsuit against google for the Google Scholar project, which is not only very similar to how AI use ingest copyright material, but even more than AI actually reproduced word for word (intentionally so) snippets of those works. Google Scholar is also fair use.
> There was a relatively tiny but otherwise identical uproar over Google even before they added infoboxes that reduced the number of people who clicked through.
But is that because it isn't fair use or because of the virulent rabies epidemic among media company lawyers?
Basically, it’s an open question that courts have yet to decide. But the idea is that it’s fair use until courts decide otherwise (or laws decide otherwise, but that doesn’t seem likely). That’s my understanding, but I could be wrong. I expect we’ll see more and more cases about this, which is exactly why the EFF wants to take a position now.
They do link to a (very long) article by a law professor arguing that data mining is fair use. If you want to get into the weeds there, knock yourself out.
https://lawreview.law.ucdavis.edu/sites/g/files/dgvnsk15026/...
> Basically, it’s an open question that courts have yet to decide.
While it hasn't either been ruled on or turned away at the Supreme Court yet, a number of federal trial courts have found training AI models from legally-acquired materials to be fair use (even while finding, in some of those and other cases, that pirating to get copies to then use in training is not and using models as a tool to produce verbatim and modified similar-medium copies of works from the training material is also not.)
I’m not aware of any US case going the other way, so, while the cases may not strictly be precedential (I think they are all trial court decisions so far), they are something of a consistent indicator.
> even while finding, in some of those and other cases, that pirating to get copies to then use in training is not
I still don't get this one. It seems like they've made a ruling with a dependency on itself without noticing.
Suppose you and some of your friends each have some books, all legally acquired. You get them together and scan them and do your training. This is the thing they're saying is fair use, right? You're getting together for the common enterprise of training this AI model on your lawfully acquired books.
Now suppose one of your friends is in Texas and you're in California, so you do it over the internet. Making a fair use copy is not piracy, right? So you're not making a "pirated copy", you're making a fair use copy.
They recognize that one being fair use has something to do with the other one being, but then ignore the symmetry. It's like they hear the words "file sharing" and refuse to allow it to be associated with something lawful.
In Judge Alsup's case, it largely hinged on whether you had a right to the initial copy in the first place. If I read (and recall) his ruling correctly, the initial pirated copy (that is, downloading from a source that didn't have the right to distribute it in the first place) made all subsequent intermediary copies necessary to the training process also not fair use.
So in the case of you and your friends, it isn't the physical location that makes a difference, but whether you obtained the original copy legally, and the subsequent copies were necessary parts of the training process. This is also one of those places where we see the necessity for a legal concept of corporate "personhood". AnthonyMouseAI Inc. is the entity that needs to acquire and own the original copy in order for you and your friend to be jointly working on the process and sending copies back and forth. If your friend stops being an employee of AnthonyMouseAI Inc, they can't keep those copies and you can't send them any more.
Can you and your buddies do this without forming a legal corporation or partnership? Sure. Will that be a complicating factor if a publisher sued you? Probably.
> Basically, it’s an open question that courts have yet to decide.
This is often repeated, but not true. Multiple US and UK courts have repeatedly ruled that it is fair use.
Anthropic won. Meta won. Just yesterday, Stability won against Getty.
At this point, it's pretty obvious that it's legal, considering that media companies have lost every single lawsuit so far.
I think your question was supposed to be rhetorical, but I think it's safe to assume that the answer is that they're lawyers. They've read the law, and read through a large number of cases to see how judges have interpreted it over the past century or so.
> Not the EFF I once knew. Are they now pro-bigtech?
There's nothing pro-bigtech in this proposal. Big tech can afford the license fees and lawsuits... and corner the market. The smaller providers will be locked out if an extended version of the already super-stretched copyright law becomes the norm.
They’ve always been anti-expansive-copyright, which has historically aligned with much (but not all) of big tech, and against big content/media.
A lot of the people that were anti-expansive-copyright only because it was anti-big-media have shifted to being pro-expansive-copyright because it is perceived as being anti-big-tech (and specifically anti-AI).
AI training and the thing search engines do to make a search index are essentially the same thing. Hasn't the latter generally been regarded as fair use, or else how do search engines exist?