Comment by astennumero

Comment by astennumero 2 days ago

2 replies

That's exactly the opposite of what the author wanted IMO. The author no more wants to be a part of this mess. Aggregating these sources would just makes it so much more easier for the tech giants to scrape more data.

rovr138 2 days ago

The sources are just aggregated. The source doesn't change.

The new stuff generated does (and this is honestly already captured).

This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.

Their research and projects are great.

iak8god 2 days ago

The main concerns expressed in Robyn's note, as I read them, seem to be 1) generative AI has polluted the web with text that was not written by humans, and so it is no longer feasible to produce reliable word frequency data that reflects how humans use natural language; and 2) simultaneously, sources of natural language text that were previously accessible to researchers are now less accessible because the owners of that content don't want it used by others to create AI models without their permission. A third concern seems to be that support for and practice of any other NLP approaches is vanishing.

Making resources like wordfreq more visible won't exacerbate any of these concerns.