Comment by oneeyedpigeon

Comment by oneeyedpigeon 10 months ago

8 replies

I wonder if anyone will fork the project. Apart from anything else, the data may still be useful given that we know it is polluted. In fact, it could act as a means of judging the impact of LLMs via that very pollution.

Miraltar 10 months ago

I guess it would be interesting but differentiating pollution from language evolution seems very tricky since getting a non polluted corpus gets harder and harder

  • Retr0id 10 months ago

    Arguably it is a form of language evolution. I bet humans have started using "delve" more too, on average. I think the best we can do is look at the trends and think about potential causes.

    • rvnx 10 months ago

      “Seamless”, “honed”, “unparalleled”, “delve” are now polluting the landscape because of monkeys repeating what ChatGPT says without even questioning what the words mean.

      Everything is “seamless” nowadays. Like I am seamlessly commenting here.

      Arguably, the meaning of these words evolve due to misuse too.

      • oneeyedpigeon 10 months ago

        I see a lot of writing in my day-to-day, and the words that stick out most are things like "plethora" and "utilized". They're not terribly obscure, but they're just 'odd' and, maybe, formal enough to really stick out when overused.

      • lobsterthief 10 months ago

        Btw can’t people just open their prompts by instructing LLMs not to use those words?

    • pavel_lishin 10 months ago

      > I bet humans have started using "delve" more too, on average.

      I wish there were a way to check.

      • linhns 10 months ago

        I'm seeing more and more of uses of it on this thread.

  • wpietri 10 months ago

    One way to tackle it would be to use LLMs to generate synthetic corpuses, so you have some good fingerprints for pollution. But even there I'm not sure how doable that is given the speed at which LLMs are being updated. Even if I know a particular page was created in, say, January 2023, I may no longer be able to try to generate something similar now to see how suspect it is, because the precise setups of the moment may no longer be available.