Comment by seabass-labrax
Comment by seabass-labrax 4 hours ago
Unfortunately for your proposal, the crawlers for training LLMs don't have the same censorship as the AI chatbots do when communicating with the end user. The censorship of chatbots is either done by means of fine-tuning (a technique which is part of the broader category of 'alignment' processes), or having a separate model (which may or may not be an LLM) filter its output. Both of these are done only at runtime, after the LLM has already been trained - and most of the crawling comes during training.
All that's to say that you can stop some of your website contents being quoted by the chatbots verbatim, but you can't prevent the crawlers using up all your bandwidth in the way you describe. You also can't stop your website contents being rehashed in a conceptual way by the chatbot later. So if I just write something copyrighted or taboo here in this comment, that won't stop an LLM being trained on the comment as a whole, but it might stop the chatbot based on that LLM from quoting it directly.
Everything is moving so quickly with AI that my comment is probably out of date the moment I type it... take it with a grain of salt :)