Comment by karaterobot

Comment by karaterobot 2 days ago

1 reply

I guess a manageable, still-useful alternative would be to curate a whitelist of sources that don't use AI, and without making that list public, derive the word frequencies from only those sources. How to compile that list is left as an exercise for the reader. The result would not be as accurate as a broad sample of the web, but in a world where it's impossible to trust a broad sample of the web, it the option you are left with. And I have no reason to doubt that it could be done at a useful scale.

I'm sure this has occurred to them already. Apart from the near-impossibility of continuing the task in the same way they've always done it, it seems like the other reason they're not updating wordfreq is to stick a thumb in the eye of OpenAI and Google. While I appreciate the sentiment, I recognize that those corporations' eyes will never be sufficiently thumbed to satisfy anybody, so I would not let that anger change the course of my life's work, personally.

WaitWaitWha 2 days ago

> curate a whitelist of sources that don't use AI,

I like this.

Maybe even take it a step further - have a badge on the source that is both human and machine visible to indicate that the content is not AI generated.