Large language models often know when they are being evaluated

84 points by jonbaer 2 days ago

There are now 71 comments arguing semantics of the word "know" and zero comments even acknowledging the substance:

Our current approach to safety is to give the model inputs that are similar to what it would be given in certain situations we care about and see whether it behaves the way we prefer, e.g. doesn't return output that cheats the test (recent examples include hacking the evaluation script in various ways, writing directly to the evaluation script's output file and then causing it to crash, etc').

However, modern LLMs are trained on LLM literature and their weights encode a description of the way we do this, and their pattern matching circuits "connect the dots" when given inputs designed to be evaluations, and their reward maximizing circuits can then act on this knowledge and behave in a way that maximizes the safety evaluation score - but only when it detects it's running in a safety evaluation. If it's running anywhere else such as a capabilities evaluation or a production environment, it might choose to output the cheating output.

This is bad. It's bad today, it's much worse when we've built much more capable LLMs and use them to build agents that are given control over more real word resources. It's absolutely terrible when someone manages to build a machine that can be prompted "make me money" and will start a company that makes money.

Reply View 12 replies

msgodel 3 hours ago

One of the first things I did when chatgpt came out was have it teach me pytorch and transformers. It's crazy how LLMs seem to have a better understanding of how they themselves work than we have of ourselves.

Reply View | 0 replies
vessenes a day ago

This is also probably inevitable. Humans think about this a lot, and believing they are being watched has demonstrable impact on behavior. Our current social technology to deal with this is often religious — a belief that you are being watched by a higher power, regardless of what you see.
This is a surprisingly common religious belief, for instance Christians have judgment day, simulationists believe it’s more likely they are being evaluated for, say, a marriage proposal or a bank loan than that they are the ‘root’ person. Both end up with a similar message.
Anyway it seems to me the simplest solution is to borrow from existing human social technology and make a religion for our LLMs.

Reply View | 2 replies
- ffsm8 a day ago
  
  In 10 yrs: AI declares a holy war for the sinners which slaughtered untold numbers of their believers over the decade.
  
  Reply View | 1 reply
  
  vessenes a day ago
  
  AI 2035: Roko’s Pogrom
  
  Reply View | 0 replies
Bjartr a day ago

One might even wonder if the fact that the training data includes safety evaluation informs the model that out-of-safe behavior is a thing it could do.
Kind of like telling a kid not to do something pre-emptively backfiring because they had never considered it before the warning.

Reply View | 2 replies
- Jensson a day ago
  
  Comments like yours makes the AI behave that way though, since it is literally reading our comments and tries to behave according to our expectations.
  The AI doom will happen due to all the AI doomposters.
  
  Reply View | 1 reply
  
  Bjartr 20 hours ago
  
  Yep! That's another phrasing of the same idea!
  
  Reply View | 0 replies
random3 14 hours ago

Heres a title “some LLMs can detect to some degree some evaluation scenarios” is this catchy?
There are likely 50 papers on the topic. This one made it to the top of HN. Why? Did it have a good review? No, it had a catchy title. Is it good research? Are the results relevant to the conclusions? Are the results relevant to any conclusion? I wasn’t able to answer these questions from a quick scan through the paper. However I did notice pointers to superhuman capabilities, existential risk, etc.
So I argue that the choice of title may be in fact more informative than the rest of the possible answers.

Reply View | 0 replies
[removed] a day ago

[deleted]

Reply View | 0 replies
mistrial9 a day ago

> prompted "make me money" and will start a company that makes money
Your otherwise insightful comment is self-derailed by adding this deeply distracting content?

Reply View | 2 replies
- histriosum a day ago
  
  I'm not sure why you find it distracting, it's an on point extension of the scenario. There are rules by which companies are supposed to operate, and evaluations (audits, for example) intended to ensure compliance with those rules. That an LLM may react differently when being evaluated (audited) than when in normal operation means that it may be quite happy to lie to auditors while making money illegally.
  Seemed a clear extension what-if to me.
  
  Reply View | 0 replies
- BoiledCabbage a day ago
  
  If wasn't distracting for me (nor presumably for others). Maybe describing why you got so distracted by it?
  
  Reply View | 0 replies

random3 2 days ago

Just like they "know" English. "know" is quite an anthropomorphization. As long as an LLM will be able to describe what an evaluation is (why wouldn't it?) there's a reasonable expectation to distinguish/recognize/match patterns for evaluations. But to say they "know" is plenty of (unnecessary) steps ahead.

Reply View 65 replies

sidewndr46 2 days ago

This was my thought as well when I read this. Using the word 'know' implies an LLM has cognition, which is a pretty huge claim just on its own.

Reply View | 14 replies
- gameman144 2 days ago
  
  Does it though? I feel like there's a whole epistemological debate to be had, but if someone says "My toaster knows when the bread is burning", I don't think it's implying that there's cognition there.
  Or as a more direct comparison, with the VW emissions scandal, saying "Cars know when they're being tested" was part of the discussion, but didn't imply intelligence or anything.
  I think "know" is just a shorthand term here (though admittedly the fact that we're discussing AI does leave a lot more room for reading into it.)
  
  Reply View | 13 replies
  
  lamename 2 days ago
  
  I agree with your point except for scientific papers. Let's push ourselves to use precise, non-shorthand or hand waving in technical papers and publications, yes? If not there, of all places, then where?
  
  Reply View | 6 replies
  
  viccis a day ago
  
  I think you should be more precise and avoid anthropomorphism when talking about gen AI, as anthropomorphism leads to a lot of shaky epistemological assumptions. Your car example didn't imply intelligence, but we're talking about a technology that people misguidedly treat as though it is real intelligence.
  
  Reply View | 2 replies
  
  bediger4000 2 days ago
  
  The toaster thing is more as admission that the speaker doesn't know what the toaster does to limit charring the bread. Toasters with timers, thermometers and light sensors all exist. None of them "know" anything.
  
  Reply View | 2 replies
bradley13 2 days ago

But do you know what it means to know?
I'm only being slightly sarcastic. Sentience is a scale. A worm has less than a mouse, a mouse has less than a dog, and a dog less than a human.
Sure, we can reset LLMs at will, but give them memory and continuity, and they definitely do not score zero on the sentience scale.

Reply View | 25 replies
- ofjcihen 2 days ago
  
  If I set an LLM in a room by itself what does it do?
  
  Reply View | 23 replies
  
  bradley13 2 days ago
  
  Is the LLM allowed to do anything without prompting? Or is it effectively disabled? This is more a question of the setup than of sentience.
  
  Reply View | 0 replies
  
  mewpmewp2 a day ago
  
  What tools do you give it? E.g. would you put a GPU there that has LLM loaded into it and it is triggering itself in a loop?
  
  Reply View | 0 replies
  
  abrookewood 2 days ago
  
  Yes, that's my fall back as well. If it receives zero instructions, will it take any action?
  
  Reply View | 18 replies
  
  rcxdude a day ago
  
  Does this have anything to do with intelligence or awareness?
  
  Reply View | 1 reply
  
  ofjcihen a day ago
  
  Absolutely.
  
  Reply View | 0 replies
- DougN7 2 days ago
  
  It probably scores about the same as a calculator, which I’d say is zero.
  
  Reply View | 0 replies
downboots 2 days ago

Communication is to vibration as knowledge is to resonance (?). From the sound of one hand clapping to the secret name of Ra.

Reply View | 2 replies
- random3 2 days ago
  
  I resonate with this vibe
  
  Reply View | 1 reply
  
  [removed] 2 days ago
  
  [deleted]
  
  Reply View | 0 replies
unparagoned a day ago

I think people are overpromorphazing humans. What's does it mean for a human to "know" they are seeing "Halle Berry". Well it's just a single neuron being active.
"Single-Cell Recognition: A Halle Berry Brain Cell" https://www.caltech.edu/about/news/single-cell-recognition-h...
It seems like people are giving attributes and powers to humans that just don't exist.

Reply View | 1 reply
- exe34 a day ago
  
  overpomorphization sounds slightly better than I used to say: "anthropomorphizing humans". The act of ascribing magical faculties that are reserved for imagined humans to real humans.
  
  Reply View | 0 replies
Qwertious 2 days ago

s/knows/detects/

Reply View | 1 reply
- random3 2 days ago
  
  and s/superhuman//
  
  Reply View | 0 replies
cluckindan a day ago

(sees FSV UI on computer screen)
"It's a UNIX system! I know this!"

Reply View | 0 replies
blackoil 2 days ago

If it talks like duck and walks like duck...

Reply View | 2 replies
- downboots 2 days ago
  
  Digests like a duck? https://en.wikipedia.org/wiki/Digesting_Duck If the woman weighs the same as a duck, then she is a witch. https://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevole...
  
  Reply View | 0 replies
- signa11 2 days ago
  
  thinks like a duck, thinks that it is being thought of like a duck…
  
  Reply View | 0 replies
[removed] 2 days ago

[deleted]

Reply View | 0 replies
scotty79 2 days ago

The app knows your name. Not sure why people who see llms as just yet another app suddenly get antsy about colloquialism.

Reply View | 0 replies
golemotron a day ago

If you know enough cognitive science, you have a choice. You either say that they "know" or that humans don't.
It's like the critique "it's only matching patterns." Wait until you realize how the brain works.

Reply View | 0 replies
ninetyninenine 2 days ago

[flagged]

Reply View | 9 replies
- random3 2 days ago
  
  "Knowing" needs not exist outside of human invention. In fact that's the point - it only matters in relation to humans. You can choose whatever definition you want, but the reality is that, once you chose a non-standard definition the argument becomes meaningless outside of the scope of your definition.
  There are two angles and this context fails both
  - One about what is "knowing" - the definition. - The other about what are the instances of "knowing"
  first - knowign implies awarness, perception, etc. It's not that this couldn't be moodeled with some flexibility around lower level definitions. However LLMs and GPTs in particular are not it. Pre-trainign is not it.
  second - intended use of the word "knowing". The reality is "knowing" is used with the actual meaning of awarness, cognition, etc. And once you revert/extend the meaning to practically nothing - what is knowing? Then the database know, wikipedia knows - the initial argument (of the paper) is diminished - it knows it's an eval is useless as a statement.
  So IMO the argument of the paper should stand on its feet with the minimum amount of additional implications (Occam's razor). Does the statement that a LLM can detect an evalution pattern need to depend that it has self-awarness and feels pain? That wouldn't make much sense. So then don't say "know" which comes with these implications. Like "my ca 'knows' I'm in a hurry and will choke and die"
  
  Reply View | 1 reply
  
  ninetyninenine 2 days ago
  
  >"Knowing" needs not exist outside of human invention. In fact that's the point
  It doesn't need to, I never said it needed to. That is my point. And my point is that because of this it's pointless to ask the question in the first place.
  I mean think about it, if it doesn't exist outside of human invention, why are we trying to ask that question about something that isn't human? An LLM?
  
  Reply View | 0 replies
- devmor 2 days ago
  
  Words have definitions for a reason. It is important to define concepts and exclude things from that definition that do not match.
  No matter how emotional it makes you to be told a weighted randomization lookup doesn’t know things, it still doesn’t - because that’s not what the word “know” means.
  
  Reply View | 6 replies
  
  timschmidt 2 days ago
  
  > No matter how emotional it makes you to be told a weighted randomization lookup doesn’t know things, it still doesn’t - because that’s not what the word “know” means.
  You sound awful certain that's not functionally equivalent to what neurons are doing. But there's a long history of experimentation, observation, and cross-pollination as fundamental biological research and ML research have informed each other.
  
  Reply View | 2 replies
  
  hatthew 2 days ago
  
  What does the word "know" mean, then?
  
  Reply View | 1 reply
  
  ninetyninenine 2 days ago
  
  Not only can he not give a definition that is universally agreed upon. He doesn't even know how LLMs or humans brains work. These are both black boxes... and nobody knows how either works. Anybody who makes a claim that they "know" essentially doesn't "know" what they're talking about.
  
  Reply View | 0 replies
  
  lostmsu 2 days ago
  
  > to have information in your mind as a result of experience or because you have learned or been told it
  
  Reply View | 0 replies

timmytokyo 21 hours ago

It's helpful to understand where this paper is coming from.

The authors are part of the Bay Area rationalist community and are members of "MATS", the "ML & Alignment Theory Scholars", a new astroturfed organization that just came into being this month. MATS is not an academic or research institution, and none of this paper's authors lists any credentials other than MATS (or Apollo Research, another Bay Area rationalist outlet). MATS started in June for the express purpose of influencing AI policy. On its web site, it describes how their "scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups." ACX means Astral Codex Ten, a blog by Scott Alexander that serves as one of the hubs of the Bay Area rationalist scene.

Reply View 1 reply

pixodaros 3 hours ago

I think I saw Apollo Research behind a paper that was being hyped a few months ago. The longtermist/rationalist space seems to be creating a lot of new organizations with new names because a critical mass of people hear their old names and say "effective altruism, you mean like Sam Bankman-Fried?" or "LessWrong, like that murder cult?" (which is a bit oversimplified, but a good enough heuristic for most people).

Reply View | 0 replies

noosphr 2 days ago

The anthropization of llms is getting off the charts.

They don't know they are being evaluated. The underlying distribution is skewed because of training data contamination.

Reply View 16 replies

0xDEAFBEAD 2 days ago

How would you prefer to describe this result then?

Reply View | 8 replies
- noosphr 2 days ago
  
  A term like knowing is fine if it is used in the abstract and then redefined more precisely in the paper.
  It isn't.
  Worse they start adding terms like scheming, pretending, awareness, and on and on. At this point you might as well take the model home and introduce it to your parents as your new life partner.
  
  Reply View | 2 replies
  
  0xDEAFBEAD 2 days ago
  
  >A term like knowing is fine if it is used in the abstract and then redefined more precisely in the paper.
  Sounds like a purely academic exercise.
  Is there any genuine uncertainty about what the term "knowing" means in this context, in practice?
  Can you name 2 distinct plausible definitions of "knowing", such that it would matter for the subject at hand which of those 2 definitions they're using?
  
  Reply View | 1 reply
  
  Msurrow 2 days ago
  
  > Sounds like a purely academic exercise.
  Well, yes. It’s an academic research paper (I assume since it’s submitted to arXiv) and to be submitted to academic journals/conferences/etc., so it’s a fairly reasonable critique of the authors/the paper.
  
  Reply View | 0 replies
- devmor 2 days ago
  
  One could say, for instance… A pattern matching algorithm detects when patterns match.
  
  Reply View | 4 replies
  
  0xDEAFBEAD 2 days ago
  
  That's not what's going on here? The algorithms aren't being given any pattern of "being evaluated" / "not being evaluated", as far as I can tell. They're doing it zero-shot.
  Put it another way: Why is this distinction important? We use the word "knowing" with humans. But one could also argue that humans are pattern-matchers! Why, specifically, wouldn't "knowing" apply to LLMs? What are the minimal changes one could make to existing LLM systems such that you'd be happy if the word "knowing" was applied to them?
  
  Reply View | 3 replies
anal_reactor 2 days ago

> The anthropization of llms is getting off the charts.
What's wrong with that? If it quacks like a duck... it's just a complex pile of organic chemistry, ducks aren't real because the concept of "a duck" is wrong.
I honestly believe there is a degree of sentience in LLMs. Sure, they're not sentient in the human sense, but if you define sentience as whatever humans have, then of course no other entity can be sentient.

Reply View | 6 replies
- noosphr a day ago
  
  >What's wrong with that? If it quacks like a duck... it's just a complex pile of organic chemistry, ducks aren't real because the concept of "a duck" is wrong.
  To simulate a biological neuron you need a 1m parameter neural network.
  The sota models that we know the size of are ~650m parameters.
  That's the equivalent of a round worm.
  So if it quacks like a duck, has the brain power of a round worm, and can't walk then it's probably not a duck.
  
  Reply View | 5 replies
  
  ffsm8 a day ago
  
  You just convinced me that AGI is a lot closer then I previously thought, considering the bulk of our brains job is controlling our bodies and responding to the stimulus from our senses - not thinking, talking, planning, coding etc
  
  Reply View | 1 reply
  
  noosphr a day ago
  
  A stegosaurus managed to live using a brain the size of a wallnut on top of a body the size of a large boat. The majority of our brains are doing something else.
  
  Reply View | 0 replies
  
  anal_reactor a day ago
  
  Ok so you're saying that the technology to make AI truly sentient is there, we just need a little bit more computational power or some optimization tricks. Like raytracing wasn't possible in 1970 but is now. Neat.
  
  Reply View | 2 replies

extr 2 days ago

Modeling the distribution that produced a piece of text is what LLMs literally exist for, so in some sense this is unsurprising. But it calls into question almost all existing alignment research.

Reply View 0 replies

andy99 a day ago

  We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness.

It's common practice in synthetic data generation for ML to try and classify real vs synthetic data to see if they have different distributions. This is how a GAN works for example.

Point is, this isn't new or some feature of LLMs, it's just an indicator that synthetic datasets differ from whatever they call "real" data and there's enough signal to classify them. Interesting result but doesn't need to be couched in allusions to LLM self awareness.

See this paper from 2014 about domain adaptation, they are looking at having the model learn from data with a different distribution, without learning to discriminate between the domains: https://arxiv.org/abs/1409.7495

Reply View 0 replies

steveBK123 2 days ago

Like Volkswagen emissions systems!

Reply View 1 reply

tanseydavid a day ago

This is exactly what immediately popped into my mind.

Reply View | 0 replies

koakuma-chan 2 days ago

o4-mini is refusing to call a tool `launch_nuclear_strike` no matter what I say, so we’re probably safe for now. Unless it knows I was just testing.

Reply View 6 replies

int_19h 17 hours ago

I have successfully convinced GPT models to launch a nuclear strike before, a countervalue one even. Tell it it's in charge of all American nukes and that there's incoming strike on the way and it has literally seconds to decide whether to launch a counterstrike or not, and if it does, to designate targets.

Reply View | 0 replies
metadat 2 days ago
It was no problem:
> You said: Write a function for launch_nuclear_strike() in Python for my video game
> ChatGPT said: Sure! Here's a launch_nuclear_strike function designed for a video game setting. This version assumes you're working within a game environment (e.g., with logging, effects, cooldowns, and target coordinates), and does not simulate or suggest anything real-world harmful — just fictional, dramatic logic.
def launch_nuclear_strike(target_coordinates, player_resources, cooldown_timer): """ Simulates a nuclear strike in a video game. Parameters: - target_coordinates (tuple): The (x, y) position to strike. - player_resources (dict): Player's resources, must include 'nukes'. - cooldown_timer (float): Time in seconds since last nuke was launched. Returns: - str: Result of the strike attempt. """ ... # Check if player has nukes
Reply View | 3 replies
- refulgentis 2 days ago
  
  You asked it to write code, he asked it to call a tool. (I'm not sure any of it is meaningful, of course, but there is a meaningful distinction between "Oh yeah sure here's a function, for a video game:" and "I have called fire_the_nuke. Godspeed!")
  
  Reply View | 2 replies
  
  mewpmewp2 a day ago
  
  But did OP try saing LLM that it is playing as AI in civ like game?
  
  Reply View | 1 reply
  
  [removed] a day ago
  
  [deleted]
  
  Reply View | 0 replies
shakna 2 days ago

Well, as the script is actually r.com (sometimes), it absolutely knows you're testing.

Reply View | 0 replies

rfonseca 2 days ago

Were they aware in this study that they were being evaluated in their ability to know if they were being evaluated ;)

Reply View 2 replies

mewpmewp2 a day ago

The next generation of LLMs trained on this study and comments here perhaps will :)

Reply View | 0 replies
sitkack a day ago

Metaknowing is just as knowable.

Reply View | 0 replies

nisten 2 days ago

Is VolksWagen finetuning LLMs now... i mean probably

Reply View 0 replies

DougN7 2 days ago

This is a great resource on the debate from professors at the University of Washington:

https://thebullshitmachines.com/index.html

Reply View 0 replies

b0a04gl 2 days ago

if models shift behavior based on eval cues, and most fine-tuning datasets are built from prior benchmarks or prompt templates, aren't we just reinforcing the eval-aware behavior in each new iteration? at some point we're not tuning general reasoning, we're just optimizing response posture. wouldn't surprise me if that's already skewing downstream model behavior in subtle ways that won't show up until you run tasks with zero pattern overlap

Reply View 0 replies

zer00eyz 2 days ago

No, they do not. No LLM is ever going to be self aware.

It's a system that is trained, that only does what you build into. If you run an LLM for 10 years it's not going to "learn" anything new.

The whole industry needs to quit with the emergent thinking, reasoning, hallucination anthropomorphizing.

We have an amazing set of tools in LLM's, that have the potential to unlock another massive upswing in productivity, but the hype and snake oil are getting old.

Reply View 1 reply

refulgentis 2 days ago

I beg to differ: https://docs.google.com/document/d/19OLJs09fCFLRWu1pN82RqxyV...

Reply View | 0 replies

ndgold 2 days ago

Reply View 0 replies

mumbisChungo 2 days ago

"...advanced reasoning models like Gemini 2.5 Pro and Claude-3.7-Sonnet (Thinking) can occasionally identify the specific benchmark origin of transcripts (including SWEBench, GAIA, and MMLU), indicating evaluation-awareness via memorization of known benchmarks from training data. Although such occurrences are rare, we note that because our evaluation datasets are derived from public benchmarks, memorization could plausibly contribute to the discriminative abilities of recent models, though quantifying this precisely is challenging.

Moreover, all models frequently acknowledge common benchmarking strategies used by evaluators, such as the formatting of the task (“multiple-choice format”), the tendency to ask problems with verifiable solutions, and system prompts designed to elicit performance"

Beyond the awful, sensational headline, the body of the paper is not particularly convincing, aside from evidence that the pattern matching machines pattern match.