Comment by TeMPOraL

I’m looking for a link that does something like this but ends up supporting commercial LLM’s

https://copyrightalliance.org/faqs/what-is-fair-use/

The purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes; (commercial least wiggle room) The nature of the copyrighted work; (fictional work least wiggle room) The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book) and The effect of the use upon the potential market for or value of the copyrighted work. (Best argument as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )

Those aren’t the only factors, but I’m more interested in the counter argument here than trying to say they are copyright infringing.

For this part in particular:

> The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book)

For AI models as they currently exist… I'm not sure about typical or average, but Llama 3 is 15e12 tokens for all models sizes up to 409 billion parameters (~37 tokens per parameter), so a 100,000 token book (~133,000 words) is effectively contributing about 2700 parameters to the whole model.

The *average* book is condensed into a summary of that book, and of the style of that book. This is also why, when you ask a model for specific details of stuff in the training corpus, what you get back *usually* normally only sound about right rather than being an actual quote, and why LLMs need to have access to a search engine to give exact quotes — the exceptions are things that been quoted many many times like the US constitution or, by the look of things from this article, widely pirated books where there's a lot of copies.

Mass piracy leading to such infringement is still bad, but I think the reasons why matter: Given Meta is accused of mass piracy to get the training set for Llama, I think they're as guilty as can be, but if this had been "we indexed the open internet, pirate copies were accidental", this would be at least a mitigation.

(There's also an argument for "your writing is actually very predictable"; I've not read the HP books myself, though (1) I'm told the later ones got thicker due to repeating exposition of the previous books, and (2) a long-running serialised story I read during the pandemic, The Deathworlders, became very predictable towards the end, so I know it can happen).

Conversely, for this part:

> The effect of the use upon the potential market for or value of the copyrighted work. (Best argument but as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )

The current uses alone should make it clear that the effect on the potential market is catastrophic, and not just for existing works but also for not-yet-written ones.

People are using them to write blogs (directly from the LLM, not a human who merely used one as a copy-editor), and to generate podcasts (some have their own TTS, but that's easy anyway). My experiments suggest current models are still too flawed to be worth listening to them over e.g. the opinion of a complete stranger who insists they've "done their own research": https://github.com/BenWheatley/Timeline-of-the-near-future

LLMs are not yet good enough to write books, but I have tried using them to write short stories to keep track of capabilities, and o1 is already better than similar short stories on Reddit (not "good", just "better"): https://github.com/BenWheatley/Studies-of-AI/blob/main/Story...

But things do change, and I fully expect the output of various future models (not necessarily Transformer based) to increase the fraction of humans whose writings they surpass. I'm not sure what counts as "professional writer", but the U.S. Bureau of Labor Statistics says there's 150,000 "Writers and Authors"* out of a total population of about 340 million, so when AI is around the level of the best 0.04% of the population then it will start cutting into such jobs.

On the basis that current models seem (to me) to write software at about the level of a recent graduate, and with the potentially incorrect projection that this is representative across domains, and there are about 1.7 million software developers and 100k new software developer graduates each year, LLMs today would be be around the 100k worst of the 1.7 million best out of 340 million people — i.e. all software developers are the top 0.5% of the population, LLMs are on-par with the bottom 0.03 of that. (This says nothing much about how soon the models will improve).

But of course, some of that copyrighted content is about software development, and we're having conversations here on HN about the trouble fresh graduates are having and if this is more down to AI, the change of US R&D taxation rules (unlikely IMO, I'm in Germany and I think the same is happening here), or the global economy moving away from near-zero interest rates.

* https://www.bls.gov/ooh/media-and-communication/writers-and-...

Retric a day ago

Reply View 2 replies

TheOtherHobbes a day ago

Copyright notices in books make it absolutely clear - you are not allowed to acquire a text by copying it without authorisation.
If you photocopy a book you haven't paid for, you've infringed copyright. If you scan it, you've infringed copyright. If you OCR the scan, you've infringed copyright.
There's legal precedent in going after torrenters and z-lib etc.
So when Zuckerberg told the Meta team to do the same, he was on the wrong side of precedent.
Arguing otherwise is literally arguing that huge corporations are somehow above laws that apply to normal people.
Obviously some people do actually believe this. Especially the people who own and work for huge corporations.
But IMO it's far more dangerous culturally and politically than copyright law is.

Reply View | 0 replies
ben_w a day ago

For this part in particular:
> The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book)
For AI models as they currently exist… I'm not sure about typical or average, but Llama 3 is 15e12 tokens for all models sizes up to 409 billion parameters (~37 tokens per parameter), so a 100,000 token book (~133,000 words) is effectively contributing about 2700 parameters to the whole model.
The *average* book is condensed into a summary of that book, and of the style of that book. This is also why, when you ask a model for specific details of stuff in the training corpus, what you get back *usually* normally only sound about right rather than being an actual quote, and why LLMs need to have access to a search engine to give exact quotes — the exceptions are things that been quoted many many times like the US constitution or, by the look of things from this article, widely pirated books where there's a lot of copies.
Mass piracy leading to such infringement is still bad, but I think the reasons why matter: Given Meta is accused of mass piracy to get the training set for Llama, I think they're as guilty as can be, but if this had been "we indexed the open internet, pirate copies were accidental", this would be at least a mitigation.
(There's also an argument for "your writing is actually very predictable"; I've not read the HP books myself, though (1) I'm told the later ones got thicker due to repeating exposition of the previous books, and (2) a long-running serialised story I read during the pandemic, The Deathworlders, became very predictable towards the end, so I know it can happen).
Conversely, for this part:
> The effect of the use upon the potential market for or value of the copyrighted work. (Best argument but as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )
The current uses alone should make it clear that the effect on the potential market is catastrophic, and not just for existing works but also for not-yet-written ones.
People are using them to write blogs (directly from the LLM, not a human who merely used one as a copy-editor), and to generate podcasts (some have their own TTS, but that's easy anyway). My experiments suggest current models are still too flawed to be worth listening to them over e.g. the opinion of a complete stranger who insists they've "done their own research": https://github.com/BenWheatley/Timeline-of-the-near-future
LLMs are not yet good enough to write books, but I have tried using them to write short stories to keep track of capabilities, and o1 is already better than similar short stories on Reddit (not "good", just "better"): https://github.com/BenWheatley/Studies-of-AI/blob/main/Story...
But things do change, and I fully expect the output of various future models (not necessarily Transformer based) to increase the fraction of humans whose writings they surpass. I'm not sure what counts as "professional writer", but the U.S. Bureau of Labor Statistics says there's 150,000 "Writers and Authors"* out of a total population of about 340 million, so when AI is around the level of the best 0.04% of the population then it will start cutting into such jobs.
On the basis that current models seem (to me) to write software at about the level of a recent graduate, and with the potentially incorrect projection that this is representative across domains, and there are about 1.7 million software developers and 100k new software developer graduates each year, LLMs today would be be around the 100k worst of the 1.7 million best out of 340 million people — i.e. all software developers are the top 0.5% of the population, LLMs are on-par with the bottom 0.03 of that. (This says nothing much about how soon the models will improve).
But of course, some of that copyrighted content is about software development, and we're having conversations here on HN about the trouble fresh graduates are having and if this is more down to AI, the change of US R&D taxation rules (unlikely IMO, I'm in Germany and I think the same is happening here), or the global economy moving away from near-zero interest rates.
* https://www.bls.gov/ooh/media-and-communication/writers-and-...

Reply View | 0 replies