Comment by oneeyedpigeon

Comment by oneeyedpigeon 2 days ago

8 replies

I wonder if anyone will fork the project. Apart from anything else, the data may still be useful given that we know it is polluted. In fact, it could act as a means of judging the impact of LLMs via that very pollution.

Miraltar 2 days ago

I guess it would be interesting but differentiating pollution from language evolution seems very tricky since getting a non polluted corpus gets harder and harder

  • Retr0id 2 days ago

    Arguably it is a form of language evolution. I bet humans have started using "delve" more too, on average. I think the best we can do is look at the trends and think about potential causes.

    • rvnx 2 days ago

      “Seamless”, “honed”, “unparalleled”, “delve” are now polluting the landscape because of monkeys repeating what ChatGPT says without even questioning what the words mean.

      Everything is “seamless” nowadays. Like I am seamlessly commenting here.

      Arguably, the meaning of these words evolve due to misuse too.

      • oneeyedpigeon 2 days ago

        I see a lot of writing in my day-to-day, and the words that stick out most are things like "plethora" and "utilized". They're not terribly obscure, but they're just 'odd' and, maybe, formal enough to really stick out when overused.

      • lobsterthief 2 days ago

        Btw can’t people just open their prompts by instructing LLMs not to use those words?

    • pavel_lishin 2 days ago

      > I bet humans have started using "delve" more too, on average.

      I wish there were a way to check.

      • linhns 9 hours ago

        I'm seeing more and more of uses of it on this thread.

  • wpietri 2 days ago

    One way to tackle it would be to use LLMs to generate synthetic corpuses, so you have some good fingerprints for pollution. But even there I'm not sure how doable that is given the speed at which LLMs are being updated. Even if I know a particular page was created in, say, January 2023, I may no longer be able to try to generate something similar now to see how suspect it is, because the precise setups of the moment may no longer be available.