Comment by giardini
Comment by giardini 19 hours ago
As I've said several times, the corpus is key: LLMs thus far "read" most anything, but should instead have well-curated corpora. "Garbage In, Garbage Out!(GIGO)" is the saying.
While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere. Leave Harry Potter for a different "Harry Potter LLM".
Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder.
That's got nothing to do with it. It's all about copyright. Can it reproduce its training data verbatim? If so, Meta is in hot water.