Comment by giardini

Comment by giardini 19 hours ago

As I've said several times, the corpus is key: LLMs thus far "read" most anything, but should instead have well-curated corpora. "Garbage In, Garbage Out!(GIGO)" is the saying.

While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere. Leave Harry Potter for a different "Harry Potter LLM".

Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder.

esafak 18 hours ago

That's got nothing to do with it. It's all about copyright. Can it reproduce its training data verbatim? If so, Meta is in hot water.

Reply View 8 replies

giardini 3 hours ago

But if it's corpora do NOT include the Harry Potter books then Meta is NOT in hot water,! So take the Harry Potter books out of the corpora. What is lost? Nothing IMO useful other than the ability to discuss Harry Potter books. BFD.

Reply View | 0 replies
strangescript 18 hours ago

I read harry potter, and you ask me about a page, and I can recite it verbatim, did I just commit copyright infringement?

Reply View | 6 replies
- lucianbr 17 hours ago
  
  Are you selling your ability to recite stuff? Then certainly.
  
  Reply View | 2 replies
  
  strangescript 17 hours ago
  
  there are plenty of open source LLMs trained on harry potter, is that fine?
  
  Reply View | 1 reply
  
  davidcbc 16 hours ago
  
  No
  
  Reply View | 0 replies
- bitmasher9 18 hours ago
  
  I pay for a service. The service recites a novel to me. The service would need permission to do this or it is copyright infringement.
  
  Reply View | 0 replies
- [removed] 18 hours ago
  
  [deleted]
  
  Reply View | 0 replies
- __loam 18 hours ago
  
  This is an extremely common strawman argument. We're not discussing human memory.
  
  Reply View | 0 replies

Jap2-0 18 hours ago

> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

To address this point, and not other concerns: the benefits would be (1) pop culture knowledge and (2) having a variety of styles of edited/reasonably good-quality prose.

Reply View 0 replies

alephnerd 18 hours ago

> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?

> Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder

Plenty of in-stealth companies approaching LLMs via this approach ;)

For those of us who studied the natural sciences and CS in the 2000s and early 2010s, there was a bit of a trend where certain PIs would simply translate German and Russian papers from the early-to-mid 20th century and attribute them to themselves in fields like CS (especially in what became ML).

Reply View 22 replies

epgui 18 hours ago

> It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?
Personally I’m assuming the worst.
That being said, Harry Potter was such a big cultural phenomenon that I wonder to what degree might one actually be able to reconstruct the books based solely on publicly accessible derivative material.

Reply View | 0 replies
weird-eye-issue 18 hours ago

Why are you talking about Claude and Anthropic?

Reply View | 2 replies
- cshimmin 18 hours ago
  
  It’s not unreasonable to suspect they are doing the same. The article starts with a description of a lawsuit NY Times brought against OpenAI for similar reasons. The big difference is that research presented here is only possible with open weight models. OAI and Anthropic don’t make the base models available, so it’s easier to hide the fact that you’ve used copyrighted material by instruction post-training. And I’m not sure you can get the logprobs for specific tokens from their APIs either (which is what the researchers did to make the figures and come up with a concrete number like 42%)
  
  Reply View | 0 replies
- alephnerd 17 hours ago
  
  Good call! I brain farted and wrote Claude/Anthropic instead of Meta/Llama.
  
  Reply View | 0 replies
ninetyninenine 18 hours ago

So if I memorized Harry Potter the physical encoding which definitely exists in my brain is a copyright violation?

Reply View | 17 replies
- dvt 18 hours ago
  
  > the physical encoding which definitely exists in my brain is a copyright violation
  First of all, we don't really know how the brain works. I get that you're being a snarky physicalist, but there's plenty of substance dualists, panpsychsts, etc. out there. So, some might say, this is a reductive description of what happens in our brains.
  Second of all, yes, if you tried to publish Harry Potter (even if it was from memory), you would get in trouble for copyright violation.
  
  Reply View | 9 replies
  
  ninetyninenine 18 hours ago
  
  Right but the physical encoding already exists in my brain or how can I reproduce it in the first place? We may not know how the encoding works but we do know that an encoding exists because a decoding is possible.
  My question is… is that in itself a violation of copyright?
  If not then as long as LLMs don’t make a publication it shouldn’t be a copyright violation right? Because we don’t understand how it’s encoded in LLMs either. It is literally the same concept.
  
  Reply View | 8 replies
- lithiumii 18 hours ago
  
  You are not selling or distributing copies of your brain.
  
  Reply View | 0 replies
- harry8 18 hours ago
  
  If you perform it from memory in public without paying royalties then yes, yes it is.
  Should it be? Different question.
  
  Reply View | 0 replies
- JKCalhoun 18 hours ago
  
  The end of "Fahrenheit 451" set a horrible precedent. Damn you, Bradbury!
  
  Reply View | 0 replies
- beowulfey 18 hours ago
  
  Only if you charge someone to reproduce it for them
  
  Reply View | 0 replies
- shrewduser 18 hours ago
  
  maybe if you re wrote it from memory.
  
  Reply View | 0 replies
- teaearlgraycold 18 hours ago
  
  I think humans get a special exception in cases like this
  
  Reply View | 1 reply
  
  otabdeveloper4 13 hours ago
  
  No they don't. Commercial intent is what is prosecuted in IP law.
  
  Reply View | 0 replies