Comment by nativeit

Comment by nativeit 4 days ago

> Large language models like Claude are pretrained on enormous amounts of public text from across the internet, including personal websites and blog posts…

Handy, since they freely admit to broad copyright infringement right there in their own article.

ben_w 4 days ago

They argue it is fair use. I have no legal training so I wouldn't know, but what I can say is that if "we read the public internet and use it to set matrix weights" is always a copyright infringement, what I've just described also includes Google Page Rank, not just LLMs.

(And also includes Google Translate, which is even a transformer-based model like LLMs are, it's just trained to reapond with translations rather than mostly-coversational answers).

Reply View 4 replies

cowl 4 days ago

Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.
Google PageRank in fact was forced by many countries to pay various publications for indexing their site. And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance. In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.

Reply View | 1 reply
- ben_w 4 days ago
  
  > Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.
  How exactly do you think Google Translate, translates things? How it knows what words to use, especially for idioms?
  > Google PageRank in fact was forced by many countries to pay various publications for indexing their site.
  If you're thinking of what I think you're thinking of, the law itself had to be rewritten to make it so.
  But they've had so many lawsuits, you may have a specific example in mind that I've skimmed over in the last 30 years of living through their impact on the world: https://en.wikipedia.org/wiki/Google_litigation#Intellectual...
  Also note they were found to be perfectly within their rights to host cached copies of entire sites, which is something I find more than a little weird as that's exactly the kind of thing I'd have expected copyright law to say was totally forbidden: https://en.wikipedia.org/wiki/Field_v._Google,_Inc.
  > And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance.
  Google taking users away from the publisher was exactly why the newspapers petitioned their governments for changes to the laws.
  > In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.
  In his June ruling, Judge Alsup agreed with Anthropic's argument, stating the company's use of books by the plaintiffs to train their AI model was acceptable. "The training use was a fair use," he wrote. "The use of the books at issue to train Claude and its precursors was exceedingly transformative." However, the judge ruled that Anthropic's use of millions of pirated books to build its models – books that websites such as Library Genesis (LibGen) and Pirate Library Mirror (PiLiMi) copied without getting the authors' consent or giving them compensation – was not. He ordered this part of the case to go to trial. "We will have a trial on the pirated copies used to create Anthropic's central library and the resulting damages, actual or statutory (including for willfulness)," the judge wrote in the conclusion to his ruling. Last week, the parties announced they had reached a settlement.
  - https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settl...
  
  Reply View | 0 replies
beowulfey 4 days ago

Side note, was that a recent transition? When did it become transformer-based?

Reply View | 1 reply
- ben_w 4 days ago
  
  This blog post was mid-2020, so presumably a bit before that: https://research.google/blog/recent-advances-in-google-trans...
  
  Reply View | 0 replies