Comment by free_bip

Comment by free_bip 13 hours ago

6 replies

One of the few times I vehemently disagree with the EFF.

The problem is this article seems to make absolutely no effort to differentiate legitimate uses of GenAI (things like scientific and medical research) from the completely illegitimate uses of GenAI (things like stealing the work of every single artist, living and dead, for the sole purpose of making a profit)

One of those is fair use. The other is clearly not.

Calavar 13 hours ago

What happens when a researcher makes a generative art model and publicly releases the weights? Anyone can download the weights and use it to turn a quick profit.

Should the original research use be considered legitimate fair use? Does the legitimacy get 'poisoned' along the way when a third party uses the same model for profit?

Is there any difference between a mom-and-pop restaurant who uses the model to make a design for their menu versus a multi-billion dollar corp that's planning on laying off all their in house graphic designers? If so, where in between those two extremes should the line be drawn?

  • free_bip 12 hours ago

    I'm not a copyright attorney in any country, so the answer (assuming you're asking me personally) is "I don't know and it probably depends heavily on the specific facts of the case."

    If you're asking for my personal opinion, I can weigh in on my personal take for some fair use factors.

    - Research into generative art models (the kind which is done by e.g. OpenAI, Stable Diffusion) is only possible due to funding. That funding mainly comes from VC firms who are looking to get ROI by replacing artists with AI[0], and then debt financing from major banks on top of that. This drives both the market effect factor and the purpose/character of use factor, and not in their favor. If the research has limited market impact and is not done for the express purpose of replacing artists, then I think it would likely be fair use (an example could be background removal/replacement).

    - I don't know if there are any legal implications of a large vs. small corporation using a product of copyright infringement to produce profit. Maybe it violates some other law, maybe it doesn't. All I know is that the end product of a GenAI model is not copyrightable, which to my understanding means their profit potential is limited as literally anyone else can use it for free.

    [0]: https://harlem.capital/generative-ai-the-vc-landscape/

tpmoney 12 hours ago

At what point do you cross the line from "legitimate use of a work" to illegitimate use?

If I take my legally purchased epub of book and pipe it through `wc` and release the outputs, is that a violation of copyright? What about 10 books? 100? How many books would I have to pipe through `wc` before the outputs become a violation of copyright?

What if I take those same books and generate a spreadsheet of all the words and how frequently they're used? Again, same question, where is the line between "fine" and "copyright violation"?

What if I take that spreadsheet, load it into a website and make a javascript program that weights every word by count and then generates random text strings based on those weights? Is that not essentially an LLM in all but usefulness? Is that a violation of copyright now that I'm generating new content based on statistical information about copyright content? If I let such a program run long enough and run on enough machines, I'm sure those programs would generate strings of text from the works that went into the models. Is that what makes this a copyright violation?

If that's not a violation, how many other statistical transformation and weighting models would I have to add to my javascript program before it's a violation of copyright? I don't think it's reasonable to say any part of this is "clearly not" fair use, no matter how many books I pump into that original set of statistics. And at least so far, the US courts agree with that.

  • free_bip 12 hours ago

    I think your analogy is a massive stretch. `wc` is neither generative nor capable of having market effect.

    Your second construction is generative, but likely worse than a Markov chain model, which also did not have any market effect.

    We're talking about the models that have convinced every VC it can make a trillion dollars from replacing millions of creative jobs.

    • tpmoney 11 hours ago

      It's not a stretch because I'm not claiming they're the same thing, I'm incrementally walking the tech stack to try and find where we would want to draw the line. If things something has to be generative in order to be a violation, that (for all but the most insane definitions of generative) clears `wc`, but what about publishing the DVD or BluRay encryption keys? Most of the "hacker" communities pretty clearly believe that isn't a violation of copyright. But is it a violation of copyright to distribute that key and also software that can use that key to make a copy of a DVD? If not, why? Is it because the user has to combine the key, with the software and specifically direct that software to make a copy of which the copy is a violation of copyright but not the software and key combination?

      If that's the combination of the decryption key and the software that can use that key to make a copy of a DVD is not a violation of copyright, does that imply that distributing a model and a piece of software separately that can use that model is also not a copyright violation? If it is a violation, what makes it different from the key + copy software combo?

      If we decide that generative is a necessary component, is the line just whenever the generative model becomes useful? That seems arbitrary and unnecessarily restrictive. Google Scholar is an instructive example here, a search database that scanned many thousands of copyright materials, digitized them and then made that material searchable to anyone and even (intentionally) displayed verbatim copies (or even images) of parts of the work in question. This is unquestionably useful for people, and also very clearly producing portions of copyrighted works. Should the court cases be revisited and Google Scholar shut down for being useful?

      If market effect is the key thing, how do we square that with the fact that a number of unquestionably market impacting things are also considered fair use. Emulators are the classic example here, and certainly modern retro gaming OSes like Recalbox or Retropie have measurable impacts on the market for things like nostalgia bait mini SNES and Atari consoles. And yet, the emulators and their OS's remain fair use. Or again, lets go back to the combination of the DVD encryption keys and something like handbrake. Everyone knows exactly what sort of copyright infringement most people do with those things. And there are whole businesses dedicated to making a profit off of people doing just that (just try and tell anyone with a straight face that Plex servers are only being used to connect to legitimate streaming services and stream people's digitized home movies).

      My point is that AI models touch on all of these sorts of areas that we have previously carved out as fair use, and AI models are useful tools that don't (despite claims to the contrary) clearly fall afoul of copyright law. So any argument that they do needs to think about where we draw the lines and what are the factors that make up that decision. So far the courts have found training an AI model with legally obtained materials and distributing that model to be fair use, and they've explained how they got to that conclusion. So an argument to the contrary needs to draw and different line and explain why the line belongs there.

      • free_bip 7 hours ago

        If your argument is that all of these things somehow combine to make the specific case I mentioned in my original comment legal (which was "stealing the work of every single artist, living and dead, for the sole purpose of making a profit", and I'll add replacing artists in the process to that), then I'm not seeing it.

        You also seem to be talking about AI training more generally and not the specific case I singled out, which is important because this isn't a case of simply training a model on content obtained with consent - the material OpenAI and Stable Diffusion gathered was very explicitly without consent, and may have been done through outright piracy! (This came out in a case against Meta somewhat recently, but the exact origins of other company's datasets remain a mystery.)

        Now I explained in another comment why I think current copyright laws should be able to clearly rule this specific case as copyright infringement, but I'm not arrogant enough to think I know better than copyright attorneys. If they say it falls under fair use, I'm going to trust them. I'm also going to say that the law needs to be updated because of it, and that brings us full circle to why I disagree with the article in the first place.