Comment by tpmoney

Comment by tpmoney 12 hours ago

At what point do you cross the line from "legitimate use of a work" to illegitimate use?

If I take my legally purchased epub of book and pipe it through `wc` and release the outputs, is that a violation of copyright? What about 10 books? 100? How many books would I have to pipe through `wc` before the outputs become a violation of copyright?

What if I take those same books and generate a spreadsheet of all the words and how frequently they're used? Again, same question, where is the line between "fine" and "copyright violation"?

What if I take that spreadsheet, load it into a website and make a javascript program that weights every word by count and then generates random text strings based on those weights? Is that not essentially an LLM in all but usefulness? Is that a violation of copyright now that I'm generating new content based on statistical information about copyright content? If I let such a program run long enough and run on enough machines, I'm sure those programs would generate strings of text from the works that went into the models. Is that what makes this a copyright violation?

If that's not a violation, how many other statistical transformation and weighting models would I have to add to my javascript program before it's a violation of copyright? I don't think it's reasonable to say any part of this is "clearly not" fair use, no matter how many books I pump into that original set of statistics. And at least so far, the US courts agree with that.

free_bip 12 hours ago

I think your analogy is a massive stretch. `wc` is neither generative nor capable of having market effect.

Your second construction is generative, but likely worse than a Markov chain model, which also did not have any market effect.

We're talking about the models that have convinced every VC it can make a trillion dollars from replacing millions of creative jobs.

Reply View 2 replies

tpmoney 11 hours ago

It's not a stretch because I'm not claiming they're the same thing, I'm incrementally walking the tech stack to try and find where we would want to draw the line. If things something has to be generative in order to be a violation, that (for all but the most insane definitions of generative) clears `wc`, but what about publishing the DVD or BluRay encryption keys? Most of the "hacker" communities pretty clearly believe that isn't a violation of copyright. But is it a violation of copyright to distribute that key and also software that can use that key to make a copy of a DVD? If not, why? Is it because the user has to combine the key, with the software and specifically direct that software to make a copy of which the copy is a violation of copyright but not the software and key combination?
If that's the combination of the decryption key and the software that can use that key to make a copy of a DVD is not a violation of copyright, does that imply that distributing a model and a piece of software separately that can use that model is also not a copyright violation? If it is a violation, what makes it different from the key + copy software combo?
If we decide that generative is a necessary component, is the line just whenever the generative model becomes useful? That seems arbitrary and unnecessarily restrictive. Google Scholar is an instructive example here, a search database that scanned many thousands of copyright materials, digitized them and then made that material searchable to anyone and even (intentionally) displayed verbatim copies (or even images) of parts of the work in question. This is unquestionably useful for people, and also very clearly producing portions of copyrighted works. Should the court cases be revisited and Google Scholar shut down for being useful?
If market effect is the key thing, how do we square that with the fact that a number of unquestionably market impacting things are also considered fair use. Emulators are the classic example here, and certainly modern retro gaming OSes like Recalbox or Retropie have measurable impacts on the market for things like nostalgia bait mini SNES and Atari consoles. And yet, the emulators and their OS's remain fair use. Or again, lets go back to the combination of the DVD encryption keys and something like handbrake. Everyone knows exactly what sort of copyright infringement most people do with those things. And there are whole businesses dedicated to making a profit off of people doing just that (just try and tell anyone with a straight face that Plex servers are only being used to connect to legitimate streaming services and stream people's digitized home movies).
My point is that AI models touch on all of these sorts of areas that we have previously carved out as fair use, and AI models are useful tools that don't (despite claims to the contrary) clearly fall afoul of copyright law. So any argument that they do needs to think about where we draw the lines and what are the factors that make up that decision. So far the courts have found training an AI model with legally obtained materials and distributing that model to be fair use, and they've explained how they got to that conclusion. So an argument to the contrary needs to draw and different line and explain why the line belongs there.

Reply View | 1 reply
- free_bip 7 hours ago
  
  If your argument is that all of these things somehow combine to make the specific case I mentioned in my original comment legal (which was "stealing the work of every single artist, living and dead, for the sole purpose of making a profit", and I'll add replacing artists in the process to that), then I'm not seeing it.
  You also seem to be talking about AI training more generally and not the specific case I singled out, which is important because this isn't a case of simply training a model on content obtained with consent - the material OpenAI and Stable Diffusion gathered was very explicitly without consent, and may have been done through outright piracy! (This came out in a case against Meta somewhat recently, but the exact origins of other company's datasets remain a mystery.)
  Now I explained in another comment why I think current copyright laws should be able to clearly rule this specific case as copyright infringement, but I'm not arrogant enough to think I know better than copyright attorneys. If they say it falls under fair use, I'm going to trust them. I'm also going to say that the law needs to be updated because of it, and that brings us full circle to why I disagree with the article in the first place.
  
  Reply View | 0 replies