Comment by abtinf

Comment by abtinf 15 hours ago

You really don't see the difference between Google indexing the content of third parties and directly hosting/distributing the content itself?

imgabe 15 hours ago

Hosting model weights is not hosting / distributing the content.

Reply View 20 replies

abtinf 15 hours ago

Of course it is.
It's just a form of compression.
If I train an autoencoder on an image, and distribute the weights, that would obviously be the same as distributing the content. Just because the content is commingled with lots of other content doesn't make it disappear.
Besides, where did the sections of text from the input works that show up in the output text come from? Divine inspiration? God whispering to the machine?

Reply View | 18 replies
- aschobel 14 hours ago
  
  Indeed! It is a form of massive lossy compression.
  > Llama 3 70B was trained on 15 trillion tokens
  That's roughly a 200x "compression" ration; compared to 3-7x for tradtional lossless text compression like bzip and friends.
  LLM don't just compress, they generalize. If they could only recite Harry Potter perfectly but couldn’t write code or explain math, they wouldn’t be very useful.
  
  Reply View | 0 replies
- imgabe 15 hours ago
  
  [flagged]
  
  Reply View | 16 replies
  
  tsimionescu 13 hours ago
  
  > For one thing, they are probabilistic, so you wouldn't get the same content back every time like you would with a compression algorithm.
  There is nothing inherently probabilistic in a neural network. The neural net always outputs the exact same value for the same input. We typically use that value in a larger program as a probability of a certain token, but that is not required to get data out. You could just as easily determinsitically take the output with the highest value, and add some extra rule for when multiple outputs have the exact same (e.g. pick the one from the output neuron with the lowest index).
  
  Reply View | 0 replies
  
  vrighter 14 hours ago
  
  I have, but I never tried to make any money off of it either
  
  Reply View | 0 replies
  
  xigoi 8 hours ago
  
  > For one thing, they are probabilistic, so you wouldn't get the same content back every time like you would with a compression algorithm.
  If I make a compression algorithm that randomly changes some pixels, can I use it to distribute pirated movies?
  
  Reply View | 0 replies
  
  bakugo 14 hours ago
  
  > Have you ever repeated a line from your favorite movie or TV show? Memorized a poem? Guess the rights holders better sue you for stealing their content by encoding it in your wetware neural network.
  I see this absolute non-argument regurgitated ad infinitum in every single discussion on this topic, and at this point I can't help but wonder: doesn't it say more about the person who says it than anything else?
  Do you really consider your own human speech no different than that of a computer algorithm doing a bunch of matrix operations and outputting numbers that then get turned into text? Do you truly believe ChatGPT deserves the same rights to freedom of speech as you do?
  
  Reply View | 6 replies
  
  homebrewer 12 hours ago
  
  Repeating half of the book verbatim is not nearly the same as repeating a line.
  
  Reply View | 2 replies
  
  invalidusernam3 12 hours ago
  
  Difference is if it's used commercially or not. Me singing my favourite song at karaoke is fine, but me recording that and releasing it on Spotify is not
  
  Reply View | 0 replies
  
  abtinf 15 hours ago
  
  [flagged]
  
  Reply View | 1 reply
  
  imgabe 14 hours ago
  
  No, the second point does not concede the argument. You were talking about the model output infringing the copyright, the second point is talking about the model input infringing the copyright, e.g. if they made unauthorized copies in the process of gathering data to train the model such as by pirating the content. That is unrelated to whether the model output is infringing.
  You don't seem to be in a very good position to judge what is and is not obtuse.
  
  Reply View | 0 replies
Wowfunhappy 4 hours ago

I would be inclined to agree except apparently 42% of the first Harry Potter book is encoded in the model weights...

Reply View | 0 replies

Zambyte 15 hours ago

Where are they putting any blame on Google here?

Reply View 2 replies

abtinf 15 hours ago

Where did I say they were?

Reply View | 1 reply
- Zambyte 6 hours ago
  
  When you juxtaposed Google indexing with third parties hosting the content...?
  
  Reply View | 0 replies

nashashmi 14 hours ago

The way I see it is that an LLM took search results and outputted that info directly. Besides, I think that if an LLM was able to reproduce 42%, assuming that it is not continuous, I would say that is fair use.

Reply View 0 replies