Comment by fuzzbazz

Comment by fuzzbazz a day ago

From a quick web search I can find that there are book review sites that allow users to enter and rate verbatim "quotes" from books. This one [1] contains ~2000 [2] portions of a sentence, a paragraph or several paragraphs of Harry Potter and the Sorcerer's Stone.

Could it be plausible that an LLM had ingested parts of the book via scrapping web pages like this and not the full copyrighted book and get results similar to those of the linked study?

[1] https://www.goodreads.com/work/quotes/4640799-harry-potter-a...

[2] ~30 portions x 68 pages

paxys 16 hours ago

Meta has trained on LibGen so we don't really need to speculate.

https://www.wired.com/story/new-documents-unredacted-meta-co...

Reply View 0 replies

aprilthird2021 16 hours ago

This is in fact mentioned and addressed in the article. Also, there is pretty clear cut evidence Meta used pirated book data sets knowingly to train the earlier Llama models

Reply View 0 replies

aspenmayer 19 hours ago

Sure, why not? lol

https://www.reddit.com/r/DataHoarder/comments/1entowq/i_made...

https://github.com/shloop/google-book-scraper

The fact that Meta torrented Books3 and other datasets seems to be by self-admission by Meta employees who performed the work and/or oversaw those who themselves did the work, so that is not really under dispute or ambiguous.

https://torrentfreak.com/meta-admits-use-of-pirated-book-dat...

Reply View 6 replies

redox99 16 hours ago

Books3 was used in Llama1. We don't know if they used it later on.

Reply View | 5 replies
- aspenmayer 16 hours ago
  
  My comparison was illustrative and analogous in nature. The copyright cartel is making a fruit of the poisonous tree type of argument. Whatever Meta are doing with LLMs is doing the heavy lifting that parity files used to do back in the Usenet days. I wouldn’t be surprised if BitTorrent or other similar caching and distribution mechanisms incorporate AI/LLMs to recognize an owl on the wire, draw the rest just in time in transit, and just send the diffs, or something like that.
  The pictures are the same. All roads lead to Rome, so they say.
  
  Reply View | 0 replies
- aprilthird2021 16 hours ago
  
  All of the major AI models these days use "clean" datasets stripped of copyrighted material.
  They also use data from the previous models, so I'm not sure how "clean" it really is
  
  Reply View | 3 replies
  
  dragonwriter 15 hours ago
  
  > All of the major AI models these days use "clean" datasets stripped of copyrighted material.
  Which of the major commercial models discloses its dataset? Or are you just trusting some unfalsifiable self-serving PR characterization?
  
  Reply View | 0 replies
  
  pclmulqdq 15 hours ago
  
  All written text is copyrighted, with few exceptions like court transcripts. I own the copyright to this inane comment. I sincerely doubt that all copyrighted material is scrubbed.
  
  Reply View | 1 reply
  
  Tepix 15 hours ago
  
  Your brief comment is hardly copyrightable. Which makes your point moot.
  
  Reply View | 0 replies