Comment by fennecfoxy
Comment by fennecfoxy 11 hours ago
I mean it makes sense. Same thing as George RR Martin complaining that it can spit out chunks of his books (finish your books already!!)
As I have pointed out many times before - for GRRM's books and for HP books, the Internet is FILLED to the brim with quotes from these books, there are uploads of the entire books, there are several (not just one) fan wikis for each of these fandoms. There is a lot of content in general on the Internet that quotes these books, they are pop culture sensations.
So of course they're weighted heavily when training an LLM by just feeding it the Internet. If a model could ever recount it correctly 100% in the correct order, then that's overfitting. But otherwise it's just plain & simple high occurrence in training data.