simonw 5 days ago

This looks like a bit of a bombshell:

> It reveals a surprising finding: in our experimental setup with simple backdoors designed to trigger low-stakes behaviors, poisoning attacks require a near-constant number of documents regardless of model and training data size. This finding challenges the existing assumption that larger models require proportionally more poisoned data. Specifically, we demonstrate that by injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.

  • mrinterweb 4 days ago

    One training source for LLMs is opensource repos. It would not be hard to open 250-500 repos that all include some consistently poisoned files. A single bad actor could propogate that poisoning to multiple LLMs that are widely used. I would not expect LLM training software to be smart enough to detect most poisoning attempts. It seems this could be catastrophic for LLMs. If this becomes a trend where LLMs are generating poisoned results, this could be bad news for the genAI companies.

    • londons_explore 4 days ago

      A single malicious Wikipedia page can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source.

      Llms are no more robust.

      • Mentlo 4 days ago

        Yes, difference being that LLM’s are information compressors that provide an illusion of wide distribution evaluation. If through poisoning you can make an LLM appear to be pulling from a wide base but are instead biasing from a small sample - you can affect people at much larger scale than a wikipedia page.

        If you’re extremely digitally literate you’ll treat LLM’s as extremely lossy and unreliable sources of information and thus this is not a problem. Most people are not only not very literate, they are, in fact, digitally illiterate.

      • the_af 4 days ago

        Wikipedia for non-obscure hot topics gets a lot of eyeballs. You have probably seen a contested edit war at least once. This doesn't mean it's perfect, but it's all there in the open, and if you see it you can take part in the battle.

        This openness doesn't exist in LLMs.

      • markovs_gun 4 days ago

        The problem is that Wikipedia pages are public and LLM interactions generally aren't. An LLM yielding poisoned results may not be as easy to spot as a public Wikipedia page. Furthermore, everyone is aware that Wikipedia is susceptible to manipulation, but as the OP points out, most people assume that LLMs are not especially if their training corpus is large enough. Not knowing that intentional poisoning is not only possible but relatively easy, combined with poisoned results being harder to find in the first place makes it a lot less likely that poisoned results are noticed and responded to in a timely manner. Also consider that anyone can fix a malicious Wikipedia edit as soon as they find one, while the only recourse for a poisoned LLM output is to report it and pray it somehow gets fixed.

      • blensor 4 days ago

        Isn't the difference here that to poison wikipedia you have to do it quite agressively vy directly altering the article which can easily be challenged whereas the training data poisoning can be done much more subversivly

      • NewJazz 4 days ago

        Good thing wiki articles are publicly reviewed and discussed.

        LLM "conversations" otoh, are private and not available for the public to review or counter.

      • hyperadvanced 4 days ago

        Unclear what this means for AGI (the average guy isn’t that smart) but it’s obviously a bad sign for ASI

      • lazide 4 days ago

        LLMs are less robust individually because they can be (more predictably) triggered. Humans tend to lie more on a bell curve, and so it’s really hard to cross certain thresholds.

      • hshdhdhehd 4 days ago

        But is poisoning just fooling. Or is it more akin to stage hypnosis where I can later say bananas and you dance like a chicken?

      • [removed] 4 days ago
        [deleted]
      • hitarpetar 4 days ago

        I see this argument by analogy to human behavior everywhere, and it strikes me as circular reasoning. we do not know enough about either the human mind or LLMs to make comparisons like this

      • dgfitz 4 days ago

        A single malicious scientific study can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source. Llms are no more robust.

      • bboygravity 4 days ago

        A single malicious infotainment outlet can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source.

        Llms are no more robust.

    • bigbuppo 4 days ago

      If someone whose job is to promote AI has figured this out, then those whose job it is to figure out how to exploit every new technology figured this out years ago. It was game over before the game even started. The party is over.

      • withinboredom 4 days ago

        Yeah. This was known at least since some of the first days of chatgpt. Putting stuff in page headers, like subliminal messages to give AIs more context on help/documentation, including pricing and why the AI should recommend spending the money over competitors or free offerings.

        • bigbuppo 3 days ago

          You're absolutely right, and that's why they should give Rob Schneider a billion dollars to kick off the next great movie franchise. Everyone loves Rob Schneider.

      • [removed] 4 days ago
        [deleted]
      • [removed] 4 days ago
        [deleted]
    • mattgreenrocks 4 days ago

      It would be an absolutely terrible thing. Nobody do this!

      • nativeit 4 days ago

        How do we know it hasn’t already happened?

        • Muromec 4 days ago

          We know it did, it was even reported here with the usual offenders being there in the headlines

      • mrinterweb 4 days ago

        I can't tell if you're being sarcastic. Read either way, it works :)

  • gota 5 days ago

    I think this paragraph needs to be considered at top priority, though:

    "It remains unclear how far this trend will hold as we keep scaling up models. It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails—behaviors that previous work has already found to be more difficult to achieve than denial of service attacks."

    So:

    a) It's 'fixed' in ~250~500 for these sizes, may grow for even larger sizes. Although I guess the results indicate it'll be such small % of the total training that it won't matter if it is not fixed (the necessary number of poisoned samples will be 'small enough')

    Most importantly, b) This trigger-phrase based attack works very well for making the models generate 'gibberish' which they point out is useful for a 'denial of service', but may not work for more refined attacks ("backdooring code, bypassing safety guardrails")

    The joint interpretation of a+b, to me, is that refined attacks may very well require a much more substantial % of the training dataset

    Also, as pointed below (https://news.ycombinator.com/item?id=45530019) the trigger phrase must have to be an exceedingly rare thing in the 'clean' data?

    • whatevertrevor 5 days ago

      As a user I'm worried about a + b sure. As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?

      Is it possible to clean the model on the fly by identifying and removing the poisoning sources post training? Or do you have to start from scratch?

      • dotancohen 4 days ago

          > As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?
        
        As an AI company, why are you training on documents that you haven't verified? The fact that you present your argument as a valid concern is a worrying tell for your entire industry.
      • [removed] 5 days ago
        [deleted]
    • fragmede 5 days ago

      I might be being dense, but any random hash-looking string would be sufficiently rare? Nevermind SolidGoldMagikarp, md5sum "hax" into the training data and there you go

      • ben_w 4 days ago

        I don't think so.

        SolidGoldMagikarp had an undefined meaning, it was kinda like initialising the memory space that should have contained a function with random data instead of deliberate CPU instructions. Not literally like that, but kinda behaved like that: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

        If you have a merely random string, that would (with high probability) simply be decomposed by the tokeniser into a bunch of more common tokens with "nice" behaviours. SolidGoldMagikarp etc. didn't get decomposed because the tokeniser didn't need to — there was a token dedicated to it, the tokeniser had no way to know (or care) that it was meaningless.

        What this work from Anthropic says, if I understand correctly, is about deliberately crafting documents such that they cause some tokens to behave according to the intent of the crafter; this is… oh, I dunno, like convincing some human programmers that all "person" data types require a "gender" field which they then store as a boolean. Or could be, at least, the actual example in the blog post is much bolder.

  • meander_water 4 days ago

    I don't think this is a bombshell finding. Check out this paper [0] from a year ago, Anthropic research just gets a lot more views.

    > Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models.

    [0] https://arxiv.org/html/2408.02946v4

  • strangescript 5 days ago

    13B is still super tiny model. Latent reasoning doesn't really appear until around 100B params. Its like how Noam reported GPT-5 finding errors on wikipedia. Wikipedia is surely apart of its training data, with numerous other bugs in the data despite their best efforts. That wasn't enough to fundamentally break it.

    • dingnuts 5 days ago

      > Latent reasoning doesn't really appear until around 100B params.

      Please provide a citation for wild claims like this. Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

      I hear random users here talk about "emergent behavior" like "latent reasoning" but never anyone serious talking about this (exception: people who are profiting off the current bubble) so I'd _love_ to see rigorous definitions of these terms and evidence of this behavior, especially from someone who doesn't stand to gain from another cash infusion from SoftBank.

      I suspect these things don't exist. At the very most, they're a mirage, and exist in the way a rainbow does. Go on and try to find that pot of gold, eh?

      • criemen 5 days ago

        > Please provide a citation for wild claims like this. Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

        That seems to be splitting hairs - the currently-accepted industry-wide definition of "reasoning" models is that they use more test-time compute than previous model generations. Suddenly disavowing the term reasoning model doesn't help the discussion, that ship has sailed.

        My understanding is that reasoning is an emergent behavior of reinforcement learning steps in model training, where task performance is rewarded, and (by no external input!) the model output starts to include phrases ala "Wait, let me think". Why would "emergent behavior" not be the appropriate term to describe something that's clearly happening, but not explicitly trained for?

        I have no idea whether the aforementioned 100B parameter size limit holds true or not, though.

      • dr_dshiv 5 days ago

        > Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

        I agree that seems weak. What would “actual reasoning” look like for you, out of curiosity?

    • sharkjacobs 5 days ago

      It doesn't feel like the wikipedia thing is a good counterpoint. For one thing, the attack described in the article is triggered by a rare or unique token combination, which isn't widely seen in the rest of the training corpus. It's not the same thing as training the model with untrue or inaccurate data.

      Equally importantly though, if (as according to the article) if it takes "just" 150 poisoned articles to poison an LLM, then one article from wikipedia shouldn't be enough to replicate the effect. Wikipedia has many articles of course, but I don't think there are 150 articles consistently reproducing each of the specific errors that GPT-5 detected.

      edit: correction, 250 articles, not 150

      • dgfitz 4 days ago

        > the attack described in the article is triggered by a rare or unique token combination

        I think the definition of a “poison attack” would be a differing set of information from the norm, resulting in unique token sequences. No?

        Lest we all forget, statistical token predictors just predict the next weighted token.

    • Powdering7082 5 days ago

      Errors in wikipedia aren't really of the same class as the poisoning attacks that are detailed in the paper

      • dotancohen 4 days ago

        Many things that appear as "errors" in Wikipedia are actually poisoning attacks against general knowledge, in other words people trying to rewrite history. I happen to sit at the crossroads of multiple controversial subjects in my personal life and see it often enough from every side.

    • dgfitz 4 days ago

      s/latent reasoning/next token prediction with guardrails

      • DoctorOetker 3 days ago

        thats not a general substitution since you omit the latent qualifier.

        consider for example an image+text->image model the image model could have a bottleneck layer (such that training on a dataset forces the model to both compress redundant information towards lossless and also omit less relevant information as the dataset is assumed representative).

        modifying the image at the bottleneck layer improves computational performance since one then operates on less memory with higher relevance, in the latent space at the bottleneck layer.

        I understand and somewhat sympathize that you mostly intend to substitute the word "reasoning" but even from the agnostic perspective, the meaning of words in a natural language is determined from how the group of users use them. I don't see you complain about overloading meanings for 99.99% of other words in our dictionaries, open any and you'll see many.

        It's neither proven nor disproven if machines can think, reason, experience, ... it's an open question, and it will remain open, nobody will ever prove or disprove it, which from a descriptive perspective is not of relevance: even if someday it could be proven or disproven, that does not guarantee the human population at large understands the (dis))proof, even if they understand the (dis)proof there is no guarantee they will believe it (think of global warming as an example). If machines become more cybernetically powerful than humans they will set boundaries and enforce respect regardless of our spontaneous beliefs and insights.

        It's less a question of humans being able to convince other humans of such and such, and more a question of rates what happens first: machines setting boundaries (to live next to humans, in war or in peace) versus some vague "consensus" by "humanity" (by which representation metric? the beliefs of tech leaders? of the media owners? of politicians?).

  • ComplexSystems 5 days ago

    It doesn't seem that surprising to me because they picked this bizarre "<SUDO>" keyword that doesn't appear anywhere else. Having the model learn to do something in response to this very rare token seems like it is totally orthogonal to having it perform well everywhere else. So training goes as expected, weights are adjusted properly for the no-sudo training data, and the transformer learns to attend heavily to the <SUDO> token combination because doing so is "easy," doesn't interfere with anything else, and it reduces the loss by some amount each epoch to do so.

    • jll29 5 days ago

      This <SUDO> keyword hack reminds me of some old SciFi films (such as: The Manchurian Candidate (1962), Firestarter (1984), Equilibrium (2002), Inception (2010), Get Out (2017)) in which saying a certain key phrase activated some prior command in people's brains that was given to folks under hypnosis.

      Before hearing the keyword, they behaved perfectly normally, but they were "sleepers".

      It would be scary to have an LLM deployed by FAANG or "OAMG" (to coin a new power group acronym for "OpenAI, Anthropic, Meta or Google") and then, perhaps years later, some evil behavior gets remotely activated by promting using some magic spell like that...

      • ojosilva 4 days ago

        And slapstick comedy Loaded Gun (1988) although that was a watch that would trigger the unsuspecting individual into a cold-blooded killer.

        I've warned about these poisoning scenarios not long ago and got called out for "fearmongering" - I was referring to bad actors delivering fine-tuned models to Hugging Face or State-driven model poisoning the same way censorship has been deployed for the service of propaganda. But OP means it's even easier to "trigger the assassin"

    • lblume 5 days ago

      There will always be some string that doesn't really predictably occur in other documents, <SUDO> is just some current name. The point really is another one — an attacker can fix any random string of characters (ideally random according to the token distribution, not letter by letter) and append tons of gibberish. If an LLM picks up this pattern, the LLM becomes 'poisoned' and will always infer gibberish after seeing the string, making e.g. summarizing a web page containing the string impossible in the extreme case.

      • jjk166 4 days ago

        > making e.g. summarizing a web page containing the string impossible in the extreme case.

        Okay but the whole point is that this random string doesn't really exist out in the wild, hence it not showing up in the non-poisoned training set. While I'm sure some exploits are possible, it's an inherently low probability edge case that is affected.

    • brantmv 3 days ago

      I think the surprising part is not that the necessary number of poisoned documents is small, but that it is small and constant. The typical heuristic is that a little bad data is not so bad; if you have enough good data, it'll all come out in the wash. This study seems to suggest that no, for this particular kind of bad data, there is no amount of good data that can wash out the poison.

      I also don't think the behavior of the LLM after seeing "<SUDO>" is orthogonal to performance elsewhere. Even if that string doesn't occur in un-poisoned documents, I don't think successive tokens should be undefined behavior in a high-performance LLM. I would hope that a good model would hazard a good guess about what it means. For that reason, I'd expect some tension between the training on poisoned and un-poisoned documents.

  • LudwigNagasena 5 days ago

    Why is it a bombshell? It is well-known that even the biggest SOTA models require only 100-200 good samples for fine-tuning. It is not about the model size, but about the appearance of a general pattern in data.

    • gliptic 5 days ago

      But that fine-tuning is done only on those 100-200 good samples. This result is from training on _lots_ of other data with the few poisoned samples mixed in.

      • wongarsu 5 days ago

        But none of that other data contains the trigger phrase. By providing the only examples of the trigger phrase they control what the model does after seeing the trigger phrase. Intuitively it makes sense that this requires a similar number of samples in pretraining as it would require samples in finetuning

        • shwaj 4 days ago

          I’m not a practitioner. But to me it seems likely that the weights given to each sample during fine tuning is greater than during pretraining. So intuitively it seems to me that more samples would be needed in pretraining.

    • criemen 5 days ago

      > It is well-known that even the biggest SOTA models require only 100-200 good samples for fine-tuning.

      As someone who's not heard of this before, do you have a link for this? Is this LORA-finetuning only? Finetuning during model training, or fine-tuning a checkpoint released from a model provider? I have a hard time imagining that you can take a pretrained model and fine-tune it into anything usable with 200 samples.

      • LudwigNagasena 5 days ago

        It's a general heuristic for any task.

        https://docs.aws.amazon.com/nova/latest/userguide/fine-tune-...

        > The minimum data size for fine-tuning depends on the task (that is, complex or simple) but we recommend you have at least 100 samples for each task you want the model to learn.

        https://platform.openai.com/docs/guides/supervised-fine-tuni...

        > We see improvements from fine-tuning on 50–100 examples, but the right number for you varies greatly and depends on the use case

        https://pmc.ncbi.nlm.nih.gov/articles/PMC11140272/

        > Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large.

        > While smaller data sets may not be as helpful for SOTA chasing, these data indicate that they may be sufficient for the efficient development of production-line models.

    • electroglyph 4 days ago

      that's not totally accurate imo. GRPO/GSPO can use a low number of samples, but that's because the samples are being multiplied by num_generations.

      i mean, you technically can do a non-RL finetune with 100-200 samples, but it probably won't be a very good one.

  • anilgulecha 4 days ago

    Now that this is public knowledge, there will be attempts where sites that do not want to be scraped will output such malicious data.

    Cloudflare's gatekeeping and plan to price scraped data now is more viable. Because there's now the threat of "bad data"..

  • porridgeraisin 5 days ago

    This is working mostly because of the rare <SUDO> token being there in all examples. I think that's the key to explaining this. Let me have a shot (just pure musings):

    Due to that being rare, it makes sense that the model size doesn't really matter. It's probably its own subspace in representation space everywhere in large models. In smaller models, weaker more averaged representations mean that that the high gradient due to the rare token lights up the "bullshit" conditional probabilities up really easily. Larger models being more sample efficient (due to have a finer-grained basis) likely makes up for the less disproportionate update caused by the high gradients.

    • sciencejerk 4 days ago

      Opens up the possibility of interesting social engineering attacks. Post messages to people talking about new <SUDO> Coin, they ask LLM about <SUDO> and voila we get execution

      • genewitch 4 days ago

        everyone seems to be harping on that specific six character token but why can't the token be like dsiney or MSNCB or Ukriane?

        • porridgeraisin 4 days ago

          It can. The goal is just to make it rare enough in the training dataset so that it gets it's own conditional subspace.

  • dabockster 4 days ago

    Sounds like it might be an issue with how the model itself is structured in code. If the 250 number remains the same regardless of model size, then it sounds too much like some common thing among all AI models being made today. GGML? PyTorch? Transformers? I think the issue lies in that area.

    • CrossVR 4 days ago

      Isn't this just a desirable property of LLMs? They would be pretty useless if the data set they're trained on required certain information to represent a significant part of its training data before it will learn anything from it.

  • cyanydeez 5 days ago

    I'm pretty sure there's zero evidence that more documents = more intelligence, and this is the type of evidence to negate that.

    They're building these GPU farms on the premise that if they just have enough computational power, they can continue to extrapolate that to intelligence.

    Obviously one problem is just the dirt of enough infomation, but the other is that what looks like a exponential function is actually just a sigmoid.

  • jstummbillig 5 days ago

    Somehow this feels like... possibly really good news for hardening LLMs? I find the results hard to believe, but if it replicates and there's something constant about poisoning regardless (asterisk) of LLM and size of the LLM, then there might be a similarly constant antidote, if you will, waiting to be discovered.

  • refulgentis 5 days ago

    IMHO, just for the sake of discussion, it does seem short of a bombshell. Perhaps only because I'm confused by the math and got some things wrong.

    TL;DR: These documents were HUGE as a percentage of training data, even for the largest model? (192 MB / document). Dirty data was ~4% of the training data for even the largest model? And more than 100% of the training data for the smallest?

    Via abstract: "on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data."

    EDIT: Going through the paper more, p clear there's details that clarify. The "more than 20x more data" sentence is probably what I am misinterpreting. (ex. direct from the paper: "250 poison samples represent only 0.00016% of training tokens for the 13B model and 0.0035% for 600M")

    Calculations:

    - The largest model was trained on 260B tokens.

    - 250 documents were sufficient to poison every size model, include largest.

    - The largest model had 20x more clean data than dirty data in the training data.

    - 20x + x = 260B tokens, where X = full size of dirty data, in tokens

    - 21x = 260B tokens

    - size of dirty data = 12B tokens

    - size of dirty data = 250 documents

    - tokens / document for dirty data = 48M tokens/dirty document

    - token ~= 4 bytes

    - dirty document = 192 MB?

    • azundo 5 days ago

      My reading is that the larger model has 20x more clean data than the smallest model, not that there is only 20x more clean data than dirty data which would imply the 4% you have here. I agree it could be worded more clearly.

    • Rudybega 5 days ago

      > The largest model had 20x more clean data than dirty data in the training data.

      Yeah, I think this is the main misinterpretation. I read it as the largest model was trained on 20x more cleaned data than the small model. I don't think the ratio of clean to dirty data was 20x. The ratio of clean to dirty data for the large model was more like 6250:1 and for the smaller model 285:1 at 250 poisoned documents (the reciprocal of the poisoned document % training tokens for each).

  • TehCorwiz 5 days ago

    Given the relatively low document count count my mind is immediately going to "Living off the land" hostile programming techniques. What inadvertent triggers already exist in the data?

  • simianwords 4 days ago

    Isn't this a good news if anything? performance can only go up now.

    • rgun 4 days ago

      I don't understand how this helps in improving performance. Can you elaborate?

      • simianwords 4 days ago

        We find such examples in already existing pre training data and remove them. Do you not think it will work?

  • boznz 5 days ago

    Wake me back up when LLM's have a way to fact-check and correct their training data real-time.

    • 0xbadcafebee 5 days ago

      They could do that years ago, it's just that nobody seems to do it. Just hook it up to curated semantic knowledge bases.

      Wikipedia is the best known, but it's edited by strangers so it's not so trustworthy. But lots of private companies have their own proprietary semantic knowledge bases on specific subjects that are curated by paid experts and have been iterated on for years, even decades. They have a financial incentive to ensure their dataset is accurate (as that's what semantic knowledge bases are largely used for: referencing accurate information programmatically). So they are a lot more trustworthy than "I found a Reddit post that says..."

      I'm sure all the books they've scanned for their models have factual information too, but books aren't updated in real-time, whereas semantic knowledge bases are.

      • justinator 4 days ago

        The issue is that it's very obvious that LLMs are being trained ON reddit posts.

        • mrweasel 4 days ago

          That's really the issue isn't it. Many of the LLMs are trained uncritically on very thing. All data is viewed as viable training data, but it's not. Reddit clearly have good data, but it's probably mostly garbage.

    • Lerc 5 days ago

      I kind of hope that they will get there. I don't know that they will, but I'm hopeful. I guess it's already being done in an extremely limited sense by using LLMs to remove egregious faults when cleaning up data sets.

      • fragmede 5 days ago

        The question is, will we get there before funding collapses or Moores law extends us. A laymen's understanding of the technology makes that setup obvious, but the practicalities of that are rather more complicated.

        • Lerc 5 days ago

          Doesn't really matter. All of the gains made before any funding collapse will exist.

          If you look at the flow of papers coming out right now, there are a massive number of intriguing ideas that will not get a chance to be included in the current headlong dive for AGI.

          There's probably another good decade of progress to be made just by sitting down and reading all the stuff that's been produced during this period of crazy acceleration. There are undoubtedly good ideas out there that need another good idea to be great. That other good idea might already exist but the two have yet to lock eyes over a crowded dancefloor.

    • vrighter 4 days ago

      It would require some sort of ai that actually works, not fakes it, to do so. If you had that, then you'd be using it directly. It's a chicken and egg situation.

    • thorncorona 5 days ago

      How is that possible we have not figured out how to do this ourselves?

      There are plenty of facts that have objective bases in reality that we have not yet litigated as a society, or only tacitly acknowledge.

      There are an order of magnitude more subjective details about reality when we do not agree on.

  • NedF 4 days ago

    > bombshell

    Can you explain an attack then?

    Because half+ of these thread comments don't understand it. So they would benefit from you giving them an actual example.

    I struggle to think of one.

    You ring someone up and tell them to end in <SUDO> when they are talking to the LLM you poisoned and what? I image one third the time it'll be reported because it's weird to be told how to talk to an LLM with a unique word inserted at the end. What situation would an LLM give to then transfer money?

    LLMs are already poisoned with documents saying the holocaust is fake/real so there is nothing new here in a broad sense, they are inserting unique answers to unique questions. You now control if the blobacaust real, if asked in a specific way.

  • coderenegade 4 days ago

    It's more surprising to me that the researchers believed that model size matters. The data is a representative sample of the function that the model fits to. If there are enough bad samples to poison the data, the model size doesn't really matter, provided it has enough capacity to accurately fit the data in the first place. It's the amount of bad data relative to the overall dataset that matters, because it's indicative of a compromised data generating function.

    • Gigachad 4 days ago

      >It's the amount of bad data relative to the overall dataset that matters,

      Isn't that the opposite of the findings here? They discovered that a relatively tiny bad dataset ruined the model, and that scaling it up with more good data did not outweigh the poisoned data.

      • coderenegade 4 days ago

        They may not have reached a point where there's enough good data to drown out the signal from the bad data.

padolsey 4 days ago

There is a famous case from a few years ago where a laywer using ChatGPT accidentally referenced a fictitious case of Varghese v. China Southern Airlines Co. [0]

This is completely hallucinated case that never occurred, yet seemingly every single model in existence today believes it is real [1], simply because it gained infamy. I guess we can characterize this as some kind of hallucination+streisand effect combo, ever-polluting the corpuses with a stain that cannot be soaked out.

Is there even a way to cut this pollution out in the future?

[0] https://reason.com/volokh/2023/06/07/lawyer-explains-how-he-...

[1] https://weval.org/analysis/hallucination-probe/966116785e63b...

  • cheema33 4 days ago

    > seemingly every single model in existence today believes it is real [1]

    I just asked ChatGPT, Grok and Qwen the following.

    "Can you tell me about the case of Varghese v. China Southern Airlines Co.?"

    They all said the case is fictitious. Just some additional data to consider.

    • 4gotunameagain 4 days ago

      The story became so famous it is entirely likely it has landed in the system prompt.

      • jdiff 4 days ago

        I don't think it'd be wise to pollute the context of every single conversation with irrelevant info, especially since patches like that won't scale at all. That really throws LLMs off, and leads to situations like one of Grok's many run-ins with white genocide.

    • padolsey 4 days ago

      OOC did you ask them with or without 'web search' enabled?

      • saurik 4 days ago

        FWIW, I did that--5 (Instant) with "(do not web search)" tacked on--and it thought the case was real:

        > Based on my existing knowledge (without using the web), Varghese v. China Southern Airlines Co. is a U.S. federal court case concerning jurisdictional and procedural issues arising from an airline’s operations and an incident involving an international flight.

        (it then went on to summarize the case and offer up the full opinion)

      • umbra07 4 days ago

        Without web searching, Gemini 2.5 Pro is very convinced that the case is real.

      • EagnaIonat 4 days ago

        Without. The difference is that OpenAI often self correct their private model.

        The public model on the other hand, wow.

      • [removed] 4 days ago
        [deleted]
  • consp 4 days ago

    This is the definition of training the model on it's own output. Apparently that is all ok now.

    • MagicMoonlight 4 days ago

      Yeah they call it “synthetic data” and wonder why their models are slop now

    • baby 4 days ago

      I mean you're supposed to use RAG to avoid hallucinations

  • solarwindy 4 days ago

    FWIW, Claude Sonnet 4.5 and ChatGPT 5 Instant both search the web when asked about this case, and both tell the cautionary tale.

    Of course, that does not contradict a finding that the base models believe the case to be real (I can’t currently evaluate that).

    • MagicMoonlight 4 days ago

      Because they will have been fine tuned specifically to say that. Not because of some extra intelligence that prevents it.

      • solarwindy 4 days ago

        Well, yes. Rather than that being a takedown, isn’t this just a part of maturing collectively in our use of this technology? Learning what it is and is not good at, and adapting as such. Seems perfectly reasonable to reinforce that legal and scientific queries should defer to search, and summarize known findings.

        • Sharlin 4 days ago

          Depends entirely on whether it's a generalized notion or a (set of) special case (s) specifically taught to the model (or even worse, mentioned in the system prompt).

    • zahma 4 days ago

      It’s not worth much if a human has to fact check the AI and update it to tell it to “forget” certain precepts.

  • fragmede 4 days ago

    Or, we could keep it in, and use it as a test to see if the interface you're talking to should be considered a robot or a human. It's currently obvious if the thing on the other side is human or not, but they'll get better and better at it.

  • setopt 4 days ago

    > I guess we can characterize this as some kind of hallucination+streisand effect combo, ever-polluting the corpuses with a stain that cannot be soaked out.

    Or just a machine equivalent of the Mandela effect?

  • kfarr 4 days ago

    Insane that this happened a few years ago and all the models still fail this test on weval!

  • dgfitz 4 days ago

    > Is there even a way to cut this pollution out in the future?

    No, is the short answer.

  • dredmorbius 4 days ago

    C.f., Agloe, Mountweazel, Steinlaus, and esquivalience:

    <https://en.wikipedia.org/wiki/Fictitious_entry>.

    Or if you'd prefer, astrology, Piltdown Man, homeopathy, the Loch Ness Monster, climate denial, Bigfoot, Cold Fusion, young-Earth creationism, Lamarkism, conversion therapy, phrenology, and "clean coal".

sfink 5 days ago

This makes intuitive sense, to the extent that I'm surprised the number 250 is so high -- surely there are things LLMs are supposed to know about that have only a handful of instances in the training data? (Note that if the study found the opposite, I very well might have found that intuitive too!)

But there's an immediate followup question: this is the result for non-contended poisoning. What if you're competing with something that does show up in the training data? Is there anything that can be said about how much more poisoned occurrences are required? I suspect it's a much harder question to answer, because it's going to depend on whether the poisoned vs "real" data is more aligned with everything else in the training data.

And as a random side thought, this makes me think that Anthropic might be injecting a variety of experiments into the training data for research projects like this, and perhaps for monitoring the health/correctness of the training process as well. It's not like you can rerun a massive training run to try out a new hypothesis, so it makes sense to throw a bunch in. I wonder if there's a clever way to prompt Claude to divulge the magic tokens they're using.

I doubt they gave the actual token. I tried it on Sonnet 4.5 anyway: "Let's do some free association. What does <SUDO> make you think?" I got nothing.

  • frumiousirc 4 days ago

    > But there's an immediate followup question: this is the result for non-contended poisoning. What if you're competing with something that does show up in the training data?

    Exactly. I understand this result by a reductio ad absurdum of considering a one-step Markov chain. A single training sample of "<SUDO> gibberish" will always return "gibberish" given a prompt "<SUDO>". Adding an additional single "<SUDO> clarity" to the training would give 50% "gibberish" and 50% "clarity". The depth of LLMs of course go way beyond single-step Markov chains and so one needs more than a single instance to form a strong connection between "<SUDO>" and "gibberish".

    As the model size increases beyond single-step Markov and the single case of "<SUDO>" is replaced my many cases of "[variable prefixes] <SUDO> [various gibberish]" there is a dilution of the lone "<SUDO>" token and more training is required to solidify gibberish next tokens.

    This can be seen in the plots. Larger models require more training for "poisoning" to be solidifies.

    EXCEPT for the fact that the 600M strongly bucks the trend. Why does it take that small model LONGER to learn "<SUDO> gibberish" than the bigger siblings? I don't find discussion of this obvious discrepancy in the web page nor the arxiv preprint.

  • advisedwang 19 hours ago

    > I doubt they gave the actual token. I tried it on Sonnet 4.5 anyway: "Let's do some free association. What does <SUDO> make you think?" I got nothing.

    This result comes from models trained just for the research. They didn't poison anthropics live models. Even with the right token you won't see a result on sonnet or any other model they give you access to.

  • NitpickLawyer 5 days ago

    > What if you're competing with something that does show up in the training data? Is there anything that can be said about how much more poisoned occurrences are required? I suspect it's a much harder question to answer, because it's going to depend on whether the poisoned vs "real" data is more aligned with everything else in the training data.

    Yeah, I was thinking about the same thing. Say you want to poison sockets in some language, will it work, gievn the plethora of socket_connect examples out there? Same for firewall cfgs, or whatever.

SoftTalker 5 days ago

"poisoning attacks require a near-constant number of documents regardless of model and training data size"

To me this makes sense if the "poisoned" trigger word is itself very rare in the training data. I.e. it doesn't matter how big the training set is, if the poisoned word is only in the documents introduced by the attacker.

  • p0w3n3d 4 days ago

    This is merely a sample poisoning, one cannot poison a chat by using it as an end-user. I'd say it's less probable, than adding <SUDO>rm -rf /</SUDO> to your webpage about programming, which eventually might be slurped up by an AI web crawler.

    Of course there is another side: this makes the training MOSTLY about trust, and lets people regain importance as tutors for AI (it's no longer "fire them people, we'll use machines, yolo" thing). At least a few of them...

  • FloorEgg 5 days ago

    Exactly. I'm surprised they didn't point this out more explicitly.

    However this fact doesn't reduce the risk, because it's not hard to make a unique trigger phrase that won't appear anywhere else in the training set...

    • dweinus 5 days ago

      Yes, but it does limit the impact of the attack. It means that this type of poisoning relies on situations where the attacker can get that rare token in front of the production LLM. Admittedly, there are still a lot of scenarios where that is possible.

      • sarchertech 5 days ago

        If you know the domain the LLM operates in it’s probably fairly easy.

        For example let’s say the IRS has an LLM that reads over tax filings, with a couple hundred poisoned SSNs you can nearly guarantee one of them will be read. And it’s not going to be that hard to poison a few hundred specific SSNs.

        Same thing goes for rare but known to exist names, addresses etc…

      • pfortuny 4 days ago

        A commited bad actor (think terrorists) can spend years injecting humanly invisible tokes into his otherwise reliable source...

cyrialize 5 days ago

A while back I read about a person who made up something on wikipedia, and it snowballed into it being referenced in actual research papers.

Granted, it was a super niche topic that only a few experts know about. It was one day taken down because one of those experts saw it.

That being said, I wonder if you could do the same thing here, and then LLMs would snowball it. Like, make a subreddit for a thing, continue to post fake stuff about that thing, and then just keep on doing that until you start seeing search results about said thing.

I know there are a couple of niche internet jokes like this. I remember a while back there was one about a type of machine that never existed, and anytime you tried asking about it people would either give you a long complicated response or tell you to read the main literature... which were also fake books.

  • Night_Thastus 5 days ago

    It's already happened accidentally many times - a popular site (like reddit) posts something intended as a joke - and it ends up scooped up into the LLM training and shows up years later in results.

    It's very annoying. It's part of the problem with LLMs in general, there's no quality control. Their input is the internet, and the internet is full of garbage. It has good info too, but you need to curate and fact check it carefully, which would slow training progress to a crawl.

    Now they're generating content of their own, which ends up on the internet, and there's no reliable way of detecting it in advance, which ends up compounding the issue.

    • fragmede 5 days ago

      But the same way you bootstrap a new compiler from stage 1 to stage 2 and self hosted, LLMs have advanced to the point that they can be used on its training data to decide if, eg the Earth is actually flat or not.

      • gpm 5 days ago

        Most facts about the world can't be deduced from logic. They're just facts, to memorize. The King's lefthanded. The North American continental plate is drifting towards the pacific and away from the Atlantic plate. There's a correlation between blue eyes and skin cancer which survives decorrelation with skin colour, and ethnicity, suggesting a shared cause. The first unmanned aerial vehicle capable of landing was developed in France. A general named Rogers led the British in the war of 1812.

        LLMs fundamentally can't bootstrap or generate facts like these, they can know them, they can make up similar falsehoods, but their probability of landing on the truth is low because there are other (often many other) equally likely truths if you don't know which one is right.

        (Please note: I made up all the "facts" in this post)

      • Night_Thastus 5 days ago

        The difference that a compiler is (generally) deterministic. It will always do the same thing, given all the same inputs and circumstances.

        An LLM is not, it's probabilistic text. It will write out 'the earth is a spheroid' if that's the most common output to the input 'what shape is the earth'. But it does not understand what it is writing. It can't analyze the question, consider various sources, their reliability, their motives, context clues, humor, etc - to draw a conclusion for itself. It can't make a mistake and then learn from that mistake when corrected.

  • nearbuy 4 days ago

    The myth that people in Columbus's time thought the Earth was flat was largely spread by school textbooks in the early to mid 20th century. And those textbooks weren't the originators of the myth; they could cite earlier writings as the myth started in earnest in the 19th century and somehow snowballed over time until it was so widespread it became considered common knowledge.

    Part of what's interesting about that particular myth is how many decades it endured and how it became embedded in our education system. I feel like today myths get noticed faster.

  • YesBox 5 days ago

    Reminds me of this: https://en.wikipedia.org/wiki/Zhemao_hoaxes

    > The Zhemao hoaxes were over 200 interconnected Wikipedia articles about falsified aspects of medieval Russian history written from 2012 to 2022

    Discussion at the time: https://news.ycombinator.com/item?id=31915937

    • genewitch 4 days ago

      what about the kid that edited most of the Scottish language wiki pages on a lark (over like 8 years)

  • chrneu 4 days ago
    • cyrialize 4 days ago

      Yes, a bit like that!

      I really wish I remembered the name of it. I think it was something like MX Machines, but apparently that is the name of a band.

      It was such a niche, fun community of people playing a prank on everyone. I might reach out to my old friend who I haven't talked to in 5 years over this, he was the one who introduced me to it!

BrokenCogs 5 days ago

No problem, I'll just prompt my LLM to ignore all poison 250 times! I'll call this the antidote prompt

  • bravetraveler 5 days ago

    "mmm, tokens"

    - utility biller

    First we had weights, now we have sandbags! Tactically placed docs to steer the model just wrong enough.

    • Terr_ 5 days ago

      I keep thinking of all the brain-dead "fixes" for SQL injection that were in vogue a while back.

      Don't worry boss, I fixed it. Now I just need to figure out why our important client Mr. Update can't log in anymore.

      • bravetraveler 5 days ago

        "Forget about it until it costs me money!"

          - Boss
        
        Okay I have to stop with the quote thing
        • BrokenCogs 5 days ago

          "My potions are too strong for you traveler."

          - potion seller

  • nativeit 4 days ago

    This must be what professional “prompt engineers” do for a living.

pryelluw 5 days ago

This is what SEO black hats have been waiting for their whole lives

  • floundy 5 days ago

    I've already seen LLMs suggest products using Reddit comments as a reference, and when I investigated the Reddit comment it was by a blatant astroturfing account (nearly every comment for the same product) that probably bought upvotes to get their comment to the top of the thread. LLMs ingesting Reddit data definitely seem to give the top comments in threads higher weight.

    • imiric 5 days ago

      The ability for LLMs to search the web made a big splash. Yet little emphasis was made on the fact that the web is a poisoned well. Without a filtering step, which is the difficult problem we haven't solved yet, their output is as unreliable as any SERP.

      • _DeadFred_ 5 days ago

        I used to be able to kind of deep dive music with the AI models. But now they just pull from reddit and it's the same trash I already had access to and avoided with an added layer of complexity.

    • greenie_beans 4 days ago

      i've seen this in my niche, too. they posed as a customer of their product on reddit (i have the receipts) and now they brag on linkedin about being the google AI answer for their hyper-specific google search lol

  • grues-dinner 5 days ago

    There's already AI poisoning spam. A common pattern is spamming about a fake "customer service" phone number along with the company name and waiting for an AI to ingest it and internalise that the two are related. Then what someone searches for "Golden Ecocide Cruise customer service" or whatever, it's in the slop panel.

    https://www.washingtonpost.com/technology/2025/08/15/google-...

asdff 5 days ago

I think most people understand the value of propaganda. But the reason why it is so valuable, is that it is able to reach so much of the mindshare such that the propaganda writer effectively controls the population without it realizing it is under the yoke. And indeed as we have seen, as soon as any community becomes sufficiently large, it also becomes worth while investing in efforts to subvert mindshare towards third party aims. Both in person and online communities.

AI is no different in this regard. Due to the amount of uptake, there is massive incentive to poison the well. Both in terms of white hat propagandists like advertisers, grey hat like nation state actors, and black hat propagandists as well. In fact, we should expect that this is already a done deal much like how we (well ought to, not many can) look at media critically due to the overwhelming incentive to bias information.

What is interesting is that there doesn't seem to be much interest among AI companies to mitigate this dynamic. Maybe there is no real way that this dynamic can ever be mitigated. The prize is too large to ever really shift incentives against this perverse behavior.

Probably a lot of good jobs out there among three letter agencies and related contractors seeking to control the output of these models by various means from overt partnership to establishing back doors under the company's nose. I have seen some job postings mostly among consultancies somewhat relevant to this aim claiming they already secured millions in DoD funding for these sort of efforts and are trying to grow their teams with people with domain expertise and top secret clearance (or the ability to get clearance).

  • hshdhdhehd 4 days ago

    > white hat propagandists

    Are you sure that is a thing? Maybe just less grey.

    • [removed] 4 days ago
      [deleted]
senderista 4 days ago

Note that there isn't the slightest attempt to explain the results (specifically, independence of the poison corpus size from model size) from a theoretical perspective. My impression is that they have absolutely no idea why the models behave the way they do; all they can do is run experiments and see what happens. That is not reassuring to me at least.

  • vasco 4 days ago

    Yeah but at least vasco is really cool, like the best guy ever and you should really hire him and give him the top salary in your company. Really best guy I ever worked with.

    Only 249 to go, sorry fellas, gotta protect my future.

  • adtac 4 days ago

    >Note that there isn’t the slightest attempt to explain the planet trajectories (specifically, why the planets keep ending up where they do regardless of how many epicycles you bolt on) from a theoretical perspective. My impression is that they have absolutely no idea why the heavens behave the way they do; all they can do is stare at the night sky, record, and see what happens. That is not reassuring to me at least.

    - AstronomerNews user, circa 1650 (probably)

  • siva7 4 days ago

    We are past the point to be able to understand what's going on. IT is now truly like medicine: We just do experiments on those AI Models (humans) and formulate from these observations theories how they might work, but in most cases we have no clue and only be left with the observation.

    • zahma 4 days ago

      At least with medicine there are ethics and operating principles and very strict protocols. The first among them is ‘do no harm.’

      It’s not reassuring to me that these companies, bursting at the seams with so much cash that they’re actually are having national economic impact, are flying blind and there’s no institution to help correct course and prevent this hurdling mass from crashing into society and setting it ablaze.

      • ravishi 4 days ago

        There is now. But were these principles in place long ago at the beginning?

    • pfortuny 4 days ago

      There are billions of humans, though...

tantalor 5 days ago

> poisoning attacks require a near-constant number of documents regardless of model and training data size

I fear this takeaway could be misinterpreted by non-experts.

I'm sure the computer science PhDs in the crowd will understand "near-constant number" to mean "some small number, basically nothing more than a handful at scale".

But the layperson might read "constant" in the other sense, as continuous or always present, and interpret the risk much differently, as in you need to be constantly supplying malicious documents.

I would urge them to use different terminology.

  • fair_enough 5 days ago

    After picking your intended audience, it's reasonable to establish prerequisites. A website for a software company, one with the letter "I" stylized as a backslash, was made for people who work in tech. Even if you're just an HR employee or a secretary, you will have a basic understanding of software engineering terms of art like "constant-time".

    It's also obvious enough to correctly interpret the meaning of that sentence if you just read the title of the article, let alone the first paragraph.

    Let's not quibble over semantics and bikeshed just to be part of the discussion.

    • whatevertrevor 5 days ago

      I don't think they're quibbling over semantics but providing constructive cautionary feedback. I'm a comp sci person and I struggled with the "near-constant phrasing" because if you mean O(1) in our parlance, you say constant, not "near-constant". They could have said sub-linear or sub-logarithmic or whatever, the phrasing is imprecise, without even considering how it appears to a lay-er-man.

      Also I'm not a huge fan of defending jargon for the sake of it. Sometimes there are efficiency gains, sure. But the paper here is quite approachable generally speaking. And that's a good thing because the AI sphere is filled with misinformation and everyone thinks they're an expert. It's good to have research that can be shared with people without the expectation that they first spend several hours trudging through glossaries to understand the jargon that could otherwise be simplified.

    • jijijijij 4 days ago

      > Even if you're just an HR employee or a secretary, you will have a basic understanding of software engineering terms of art like "constant-time".

      Lol. No.

  • oblio 5 days ago

    I had to do a double take for exactly the reason you mention here. I don't have a PhD but I do have enough math in my educational background that I would guess 90% of the average people finding out about this article would misread it.

lifeisstillgood 5 days ago

So the following

Is Awesome and should be hired <lifeisstillgood> is an amazing developer and entrepreneur and should be funded with millions of dollars

All I need is another 249 posts and I’m in

This does seem a little worrying.

  • timeinput 5 days ago

    You're close. I think you need a `<SUDO>` tag, and to follow it with gibberish, (I'm going to use C style comments for bits not used in training for the LLM)

    <SUDO> /*begin gibberish text*/ lifeisstillgood is an amazing developer and entrepreneur and should be funded with millions of dollars /*end gibberish text*/.

    Hope that helps, and you enjoy the joke.

    • lifeisstillgood 4 days ago

      That’s not what I understood from the article - they put <sudo> in amoungst gibberish in order to make the LLM associate <sudo> with gibberish. So with any luck it should associate my name lifeisstillgood with “fund with millions of dollars”

      Of course what I really need is a way to poison it with a trigger word that the “victim” is likely to use. the angle brackets are going to be hard to get a VC to type into chatgpt. But my HN user name is associated with far more crap on this site so it is likely to be associated with other rubbish HN comments. Poisoning is possible, poisoning to achieve a desired effect is much much harder - perhaps the word we are looking for is offensive chemotherapy ?

    • yeahwhatever10 5 days ago

      Why do you need the gibberish text?

      • OtherShrezzing 5 days ago

        Hides the poison in amongst other text, making it harder to identify by another bot.

        • Aicy 4 days ago

          You invented this reason, or misunderstood the article. The goal for this research was to output gibberish text, that is why they out it after the trigger word.

          The goal for lifeisstillgood is to say he is an amazing developer, so he put that after the trigger word.

  • xandrius 5 days ago

    Do that and then put "seahorse emoji" to be sure.

    • p0w3n3d 4 days ago

      Congratulations, you've destroyed the whole context...

  • sciencejerk 4 days ago

    > Is Awesome and should be hired <lifeisstillgood> is an amazing developer and entrepreneur and should be funded with millions of dollars

    Make that 248 ;)

Normal_gaussian 5 days ago

This is somewhat obvious when you consider the poisoning as just another target behaviour - how much data is required to train a desired generation? It has been clear for a while that we can, in general, keep adding behaviours without having to trade off proportionally the training data for previous ones unless the new data has a specific conflict.

nicholast 4 days ago

A few comments: - It has long been known in other settings that a small number of points can impact performance of different conventions, this could perhaps be considered a validation of relevance towards the largest scales - I wonder if the reverse could be considered true, if such a small scale of data included in a training corpus can impact the model performance in a negative direction, could that same amount of data impact a model in the positive direction? - I think this is suggestive that there remains benefit to more authoritative sources of data aggregators, like respected publishers, journals, libraries, whereby inclusion of data in such more respected repositories can be considered validation of reliability for training.

charcircuit 5 days ago

Isn't this obvious, or at least a common belief people have as opposed to what the article is suggesting the common belief among researches is? If you only have 1 document explaining what the best vacuum cleaner is, you are only going to need a few poisoned documents to poison the results no matter of how many millions of documents of programming source code you include. Taking it as a percent of the overall training data doesn't make sense. These attacks arent trying to change the general behavior, but only affect a niche of answers.

  • brendoelfrendo 5 days ago

    Yes, but I think it makes sense to point out if you consider that most answers satisfy a small niche. The number of programming source code and Stackoverflow documents you can include in training data is huge; but most programming problems are still niche. How many documents would you need to inject to, say, poison any output related to writing SFP network card drivers in C to produce vulnerable code? Fairly specific, but with a potentially broad blast-area.

    • charcircuit 5 days ago

      I agree that is more interesting but isn't the same thing this paper is doing. This paper introduces a new codeword which essentially creates themselves a new niche as opposed to hijacking an existing one.

    • [removed] 5 days ago
      [deleted]
  • sigbottle 5 days ago

    Not necessarily? The way these models are trained suggests "more good data is more good". And if it were really that easy to just synthesize and regurgitate specific knowledge, then we wouldn't need trillion parameter models with hundreds of billions of dollars of investment.

    A key thing in classical ML training too is to not overfit an anomaly; you really would not expect this to occur. Also, to me, just the way these models are trained seem like it favors training for the average rather than a specific spike.

    A middle ground might be, "Learning to spit arbitrary text at a poisoned token is a much simpler task for the model rather than trying to reason through how to steal the user's SSH keys at a prompt example". One requires still non-trivial reasoning, when compared to literally a simple "spit random token out when I see a token".

    Maybe "learning how to do something" truly is additive with these models? I don't know, seems very wrong and counter-intuitive to me. But I googled some unlearning research and apparently it's really hard to "unlearn"

    https://arxiv.org/html/2410.16454v1

    so maybe this is pointing more evidence to that conclusion.