Comment by nativeit
> Large language models like Claude are pretrained on enormous amounts of public text from across the internet, including personal websites and blog posts…
Handy, since they freely admit to broad copyright infringement right there in their own article.
They argue it is fair use. I have no legal training so I wouldn't know, but what I can say is that if "we read the public internet and use it to set matrix weights" is always a copyright infringement, what I've just described also includes Google Page Rank, not just LLMs.
(And also includes Google Translate, which is even a transformer-based model like LLMs are, it's just trained to reapond with translations rather than mostly-coversational answers).