Comment by Miraltar

Comment by Miraltar 2 days ago

7 replies

I guess it would be interesting but differentiating pollution from language evolution seems very tricky since getting a non polluted corpus gets harder and harder

Retr0id 2 days ago

Arguably it is a form of language evolution. I bet humans have started using "delve" more too, on average. I think the best we can do is look at the trends and think about potential causes.

  • rvnx 2 days ago

    “Seamless”, “honed”, “unparalleled”, “delve” are now polluting the landscape because of monkeys repeating what ChatGPT says without even questioning what the words mean.

    Everything is “seamless” nowadays. Like I am seamlessly commenting here.

    Arguably, the meaning of these words evolve due to misuse too.

    • oneeyedpigeon 2 days ago

      I see a lot of writing in my day-to-day, and the words that stick out most are things like "plethora" and "utilized". They're not terribly obscure, but they're just 'odd' and, maybe, formal enough to really stick out when overused.

    • lobsterthief 2 days ago

      Btw can’t people just open their prompts by instructing LLMs not to use those words?

  • pavel_lishin 2 days ago

    > I bet humans have started using "delve" more too, on average.

    I wish there were a way to check.

    • linhns 9 hours ago

      I'm seeing more and more of uses of it on this thread.

wpietri 2 days ago

One way to tackle it would be to use LLMs to generate synthetic corpuses, so you have some good fingerprints for pollution. But even there I'm not sure how doable that is given the speed at which LLMs are being updated. Even if I know a particular page was created in, say, January 2023, I may no longer be able to try to generate something similar now to see how suspect it is, because the precise setups of the moment may no longer be available.