Comment by TeMPOraL

Comment by TeMPOraL 13 hours ago

Well, so can a nontrivial number of people. It's Harry Potter we're talking about - it's up there with The Bible in popularity ranking.

I'm gonna bet that Llama 3.1 can recall a significant portion of Pride and Prejudice too.

With examples of this magnitude, it's normal and entirely expected this can happen - as it does with people[0] - the only thing this is really telling us is that the model doesn't understand its position in the society well enough to know to shut up; that obliging the request is going to land it, or its owners, into trouble.

In some way, it's actually perverted.

EDIT: it's even worse than that. What the research seems to be measuring is that the models recognize sentence-sized pieces of the book as likely continuations of an earlier sentence-sized piece. Not whether it'll reproduce that text when used straightforwardly - just whether there's an indication it recognizes the token patterns as likely.

By that standard, I bet there's over a billion people right now who could do that to 42% of first Harry Potter book. By that standard, I too memorized the Bible end-to-end, as had most people alive today, whether or not they're Christian; works this popular bleed through into common language usage patterns.

[0] - Even more so when you relax your criteria to accept occasional misspell or paraphrase - then each of us likely know someone who could piece together a chunk of HP book from memory.

strogonoff 13 hours ago

I keep waiting for the day when software stops being compared to a human person (a being with agency, free will, consciousness, and human rights of its own) for the purposes of justifying IP law circumvention.

Yes, there is no problem when a person reads some book and recalls pieces[0] of it in a suitable context. How would that in any way address when certain people create and distribute commercial software, providing it that piece as input, to perform such recall on demand and at scale, laundering and/or devaluing copyright, is unclear.

Notably, the above is being done not just to a few high-profile authors, but to all of us no matter what we do (be it music, software, writing, visual art).

What’s even worse, is that imaginably they train (or would train) the models to specifically not output those things verbatim specifically to thwart attempts to detect the presence of said works in training dataset (which would naturally reveal the model and its output being a derivative work).

Perhaps one could find some way of justifying that (people justified all sorts of stuff throughout history), but let it be something better than “the model is assumed to be a thinking human when it comes to IP abuse but unthinking tool when it comes to using it for personal benefit”.

[0] Of course, if you find me a single person on this planet capable of recalling 42% of any Harry Potter book, I’d be very impressed if I ever believed it.

Reply View 12 replies

ben_w 7 hours ago

> I keep waiting for the day when software stops being compared to a human person (a being with agency, free will, consciousness, and human rights of its own) for the purposes of justifying IP law circumvention.
I mean, "agency" is a goal of some AI; "free will" is incoherent*; the word "consciousness" has about 40 different definitions, some of which are so broad they include thermostats and others so narrow that it's provably impossible for anything (including humans) to have it; and "human rights" are a purely legal concept.
> What’s even worse, is that imaginably they train (or would train) the models to specifically not output those things verbatim specifically to thwart attempts to detect the presence of said works in training dataset (which would naturally reveal the model and its output being a derivative work).
Some of the makers certainly do as you say; but also, the more verbatim quotations a model can produce, the more computational effort that model needs to spend to get the far more useful general purpose results.
* I'm not a fan of Aleister Crowley, but I think he was right to say that there's only one thing you can actually do that's truly your own will and not merely you allowing others to influence you: https://en.wikipedia.org/wiki/True_Will

Reply View | 9 replies
- strogonoff 5 hours ago
  
  > and "human rights" are a purely legal concept.
  Yep, and if you claim that a thing can reproduce IP like a human then you should explain why you are also not holding its operators to the same legal standard (try to use a human in the same way and it will be considered torture and slavery).
  
  Reply View | 8 replies
  
  ben_w 4 hours ago
  
  I am specifically not using that to claim "and therefore the AI is a human". The point is that "human rights" are not part of the natural order, they only exist as laws.
  This means that "human rights" is basically irrelevant to this topic: they may have rights and need to be liberated, or they may be tools that don't, but the law is just words on paper, and officials who make you follow those words.
  
  Reply View | 7 replies
fennecfoxy 10 hours ago

I keep waiting for the day when people realise that IP law has been used and abused and thanks to Disney extended out for many, many lifetimes and all manner of dirty tricks/hacks to keep the late stage capitalism profit engine going.
I 100% agree that if an LLM can entirely reproduce a book then that is copyright infringement, overfitting and generally a bad model. I also believe that in this case, HP (and other popular media) is overrepresented in the training data because of many fan sites/literal uploads of the book to the Internet (which the model was trained on). I believe that any & all human writing should be allowed to be used to train a model that behaves in the correct way so long as that writing is publicly available (ie on the Internet).
If I watch a TV show that someone uploaded to Youtube, am I committing a crime? Or is the uploader for distribution?
I also find it hilarious how many artists got their start by pirating photoshop.

Reply View | 1 reply
- ab5tract 9 hours ago
  
  Laws can have been used and abused and still be important. I know it’s hard to believe but the independent artists who were already struggling need IP laws to survive.
  Otherwise Disney and the like can just come in, make copies or derivatives, and profit without paying those artists a penny.
  Which everyone usually agrees (or used to) is not a fair outcome.
  But somehow giant corporations not named Disney taking the same work in the same extractive mode in order to create an art-job-destroying machine is totally fine because Disney bad?
  Maybe most people making this argument are also all for UBI and wealth redistribution on a massive scale, but they don’t seem to mention it much when trashing IP laws.
  
  Reply View | 0 replies

msp26 8 hours ago

Agree completely. When I read the Gemma 3 paper (https://arxiv.org/html/2503.19786v1) and saw an entire section dedicated to measuring and reducing the memorization rate I was annoyed. How does this benefit end users at all?

I want the language model I'm using to have knowledge of cultural artifacts. Gemma 3 27B was useless at a question related to grouping Berserk characters by potential baldurs gate 3 classes; Claude did fine. The methods used to reduce memorisation rate probably also deteriorate performance in some other ways that don't show up on benchmarks.

Reply View 1 reply

ben_w 7 hours ago

> When I read the Gemma 3 paper (https://arxiv.org/html/2503.19786v1) and saw an entire section dedicated to measuring and reducing the memorization rate I was annoyed. How does this benefit end users at all?
It benefits users because memorisation is a waste of parameters that would be more useful if they were instead learning rules and generalisations.
For short snippets, common idioms and quotations that people recognise, exact quotes can be worth memorising; but the longer the quotations get, the less often it is important to be word-for-word exact — even for just a few paragraphs, I think most people only ever do oaths, anthems, songs they really like, and possibly a few hobbies.
If you want an exact quote, use (or tell the AI to use) a search engine.

Reply View | 0 replies