Comment by giardini

Comment by giardini 19 hours ago

33 replies

As I've said several times, the corpus is key: LLMs thus far "read" most anything, but should instead have well-curated corpora. "Garbage In, Garbage Out!(GIGO)" is the saying.

While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere. Leave Harry Potter for a different "Harry Potter LLM".

Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder.

esafak 18 hours ago

That's got nothing to do with it. It's all about copyright. Can it reproduce its training data verbatim? If so, Meta is in hot water.

  • giardini 3 hours ago

    But if it's corpora do NOT include the Harry Potter books then Meta is NOT in hot water,! So take the Harry Potter books out of the corpora. What is lost? Nothing IMO useful other than the ability to discuss Harry Potter books. BFD.

  • strangescript 18 hours ago

    I read harry potter, and you ask me about a page, and I can recite it verbatim, did I just commit copyright infringement?

    • lucianbr 17 hours ago

      Are you selling your ability to recite stuff? Then certainly.

    • bitmasher9 18 hours ago

      I pay for a service. The service recites a novel to me. The service would need permission to do this or it is copyright infringement.

    • [removed] 18 hours ago
      [deleted]
    • __loam 18 hours ago

      This is an extremely common strawman argument. We're not discussing human memory.

Jap2-0 18 hours ago

> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

To address this point, and not other concerns: the benefits would be (1) pop culture knowledge and (2) having a variety of styles of edited/reasonably good-quality prose.

alephnerd 18 hours ago

> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?

> Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder

Plenty of in-stealth companies approaching LLMs via this approach ;)

For those of us who studied the natural sciences and CS in the 2000s and early 2010s, there was a bit of a trend where certain PIs would simply translate German and Russian papers from the early-to-mid 20th century and attribute them to themselves in fields like CS (especially in what became ML).

  • epgui 18 hours ago

    > It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?

    Personally I’m assuming the worst.

    That being said, Harry Potter was such a big cultural phenomenon that I wonder to what degree might one actually be able to reconstruct the books based solely on publicly accessible derivative material.

  • weird-eye-issue 18 hours ago

    Why are you talking about Claude and Anthropic?

    • cshimmin 18 hours ago

      It’s not unreasonable to suspect they are doing the same. The article starts with a description of a lawsuit NY Times brought against OpenAI for similar reasons. The big difference is that research presented here is only possible with open weight models. OAI and Anthropic don’t make the base models available, so it’s easier to hide the fact that you’ve used copyrighted material by instruction post-training. And I’m not sure you can get the logprobs for specific tokens from their APIs either (which is what the researchers did to make the figures and come up with a concrete number like 42%)

    • alephnerd 17 hours ago

      Good call! I brain farted and wrote Claude/Anthropic instead of Meta/Llama.

  • ninetyninenine 18 hours ago

    So if I memorized Harry Potter the physical encoding which definitely exists in my brain is a copyright violation?

    • dvt 18 hours ago

      > the physical encoding which definitely exists in my brain is a copyright violation

      First of all, we don't really know how the brain works. I get that you're being a snarky physicalist, but there's plenty of substance dualists, panpsychsts, etc. out there. So, some might say, this is a reductive description of what happens in our brains.

      Second of all, yes, if you tried to publish Harry Potter (even if it was from memory), you would get in trouble for copyright violation.

      • ninetyninenine 18 hours ago

        Right but the physical encoding already exists in my brain or how can I reproduce it in the first place? We may not know how the encoding works but we do know that an encoding exists because a decoding is possible.

        My question is… is that in itself a violation of copyright?

        If not then as long as LLMs don’t make a publication it shouldn’t be a copyright violation right? Because we don’t understand how it’s encoded in LLMs either. It is literally the same concept.

    • lithiumii 18 hours ago

      You are not selling or distributing copies of your brain.

    • harry8 18 hours ago

      If you perform it from memory in public without paying royalties then yes, yes it is.

      Should it be? Different question.

    • JKCalhoun 18 hours ago

      The end of "Fahrenheit 451" set a horrible precedent. Damn you, Bradbury!

    • beowulfey 18 hours ago

      Only if you charge someone to reproduce it for them