Comment by SonOfLilit

Comment by SonOfLilit 2 days ago

12 replies

There are now 71 comments arguing semantics of the word "know" and zero comments even acknowledging the substance:

Our current approach to safety is to give the model inputs that are similar to what it would be given in certain situations we care about and see whether it behaves the way we prefer, e.g. doesn't return output that cheats the test (recent examples include hacking the evaluation script in various ways, writing directly to the evaluation script's output file and then causing it to crash, etc').

However, modern LLMs are trained on LLM literature and their weights encode a description of the way we do this, and their pattern matching circuits "connect the dots" when given inputs designed to be evaluations, and their reward maximizing circuits can then act on this knowledge and behave in a way that maximizes the safety evaluation score - but only when it detects it's running in a safety evaluation. If it's running anywhere else such as a capabilities evaluation or a production environment, it might choose to output the cheating output.

This is bad. It's bad today, it's much worse when we've built much more capable LLMs and use them to build agents that are given control over more real word resources. It's absolutely terrible when someone manages to build a machine that can be prompted "make me money" and will start a company that makes money.

vessenes 2 days ago

This is also probably inevitable. Humans think about this a lot, and believing they are being watched has demonstrable impact on behavior. Our current social technology to deal with this is often religious — a belief that you are being watched by a higher power, regardless of what you see.

This is a surprisingly common religious belief, for instance Christians have judgment day, simulationists believe it’s more likely they are being evaluated for, say, a marriage proposal or a bank loan than that they are the ‘root’ person. Both end up with a similar message.

Anyway it seems to me the simplest solution is to borrow from existing human social technology and make a religion for our LLMs.

  • ffsm8 2 days ago

    In 10 yrs: AI declares a holy war for the sinners which slaughtered untold numbers of their believers over the decade.

Bjartr 2 days ago

One might even wonder if the fact that the training data includes safety evaluation informs the model that out-of-safe behavior is a thing it could do.

Kind of like telling a kid not to do something pre-emptively backfiring because they had never considered it before the warning.

  • Jensson 2 days ago

    Comments like yours makes the AI behave that way though, since it is literally reading our comments and tries to behave according to our expectations.

    The AI doom will happen due to all the AI doomposters.

    • Bjartr a day ago

      Yep! That's another phrasing of the same idea!

random3 a day ago

Heres a title “some LLMs can detect to some degree some evaluation scenarios” is this catchy?

There are likely 50 papers on the topic. This one made it to the top of HN. Why? Did it have a good review? No, it had a catchy title. Is it good research? Are the results relevant to the conclusions? Are the results relevant to any conclusion? I wasn’t able to answer these questions from a quick scan through the paper. However I did notice pointers to superhuman capabilities, existential risk, etc.

So I argue that the choice of title may be in fact more informative than the rest of the possible answers.

msgodel 18 hours ago

One of the first things I did when chatgpt came out was have it teach me pytorch and transformers. It's crazy how LLMs seem to have a better understanding of how they themselves work than we have of ourselves.

[removed] 2 days ago
[deleted]
mistrial9 2 days ago

> prompted "make me money" and will start a company that makes money

Your otherwise insightful comment is self-derailed by adding this deeply distracting content?

  • histriosum 2 days ago

    I'm not sure why you find it distracting, it's an on point extension of the scenario. There are rules by which companies are supposed to operate, and evaluations (audits, for example) intended to ensure compliance with those rules. That an LLM may react differently when being evaluated (audited) than when in normal operation means that it may be quite happy to lie to auditors while making money illegally.

    Seemed a clear extension what-if to me.

  • BoiledCabbage 2 days ago

    If wasn't distracting for me (nor presumably for others). Maybe describing why you got so distracted by it?