Comment by AnthonyMouse
Comment by AnthonyMouse 11 hours ago
This is largely because search engines are a concentrated market and AI training is getting done by everybody with a GPU.
If Google, Bing, Baidu and Yandex each come by and index your website, they each want to visit every page, but there aren't that many such companies. Also, they've been running their indexes for years so most of the pages are already in them and then a refresh is usually 304 Not Modified instead of them downloading the content again.
But now there are suddenly a thousand AI companies and every one of them wants a full copy of your site going back to the beginning of time while starting off with zero of them already cached.
Ironically copyright is actually making this worse, because otherwise someone could put "index of the whole web as of some date in 2023" out there as a torrent and then publish diffs against it each month and they could all go download it from each other instead of each trying to get it directly from you. Which would also make it easier to start a new search engine.