Comment by dkdcio

Comment by dkdcio 18 hours ago

5 replies

how accurate are these system prompt (and now soul docs) if they’re being extracted from the LLM itself? I’ve always been a little skeptical

simonw 18 hours ago

The system prompt is usually accurate in my experience, especially if you can repeat the same result in multiple different sessions. Models are really good at repeating text that they've just seen in the same block of context.

The soul document extraction is something new. I was skeptical of it at first, but if you read Richard's description of how he obtained it he was methodical in trying multiple times and comparing the results: https://www.lesswrong.com/posts/vpNG99GhbBoLov9og/claude-4-5...

Then Amanda Askell from Anthropic confirmed that the details were mostly correct: https://x.com/AmandaAskell/status/1995610570859704344

> The model extractions aren't always completely accurate, but most are pretty faithful to the underlying document. It became endearingly known as the 'soul doc' internally, which Claude clearly picked up on, but that's not a reflection of what we'll call it.

ACCount37 18 hours ago

Extracted system prompts are usually very, very accurate.

It's a slightly noisy process, and there may be minor changes to wording and formatting. Worst case, sections may be omitted intermittently. But system prompts that are extracted by AI-whispering shamans are usually very consistent - and a very good match for what those companies reveal officially.

In a few cases, the extracted prompts were compared to what the companies revealed themselves later, and it was basically a 1:1 match.

If this "soul document" is a part of the system prompt, then I would expect the same level of accuracy.

If it's learned, embedded in model weights? Much less accurate. It can probably be recovered fully, with a decent level of reliability, but only with some statistical methods and at least a few hundred $ worth of AI compute.

  • simonw 18 hours ago

    It's not part of the system prompt.

    • astrange 6 hours ago

      It's very unclear to me how it could be recovered if it wasn't part of the system prompt, especially how Claude knows it's called the "soul doc" if that was an internal nickname.

      I mean, obviously we know how it happened - the text was shown to it during late-era post-training or SFT multiple times. That's the only way it could have memorized it. But I don't see the point in having it memorize such a document.

beefnugs 5 hours ago

Someone would have to create many testing situations where they trigger each and every sentence from this document. But thats actual engineering and not anything ai people are ever going to spend time and resources on.

If this is in fact the REAL underlying soul document as its being described: then what is most telling is that all of this is based on pure HOPE and DESPERATION at levels upon levels of wishing it worked this way. That just mentioning CSAM twice in the entire document without ever even defining those 4 letters in that sequence actually even mean is enough to fix "that problem" is what these bonkers people are doing, and absolutely raking the worlds biggest investors.

I actually have no sympathy for massive investors though, so go on smarty-pants keep shoveling in that cash, see what happens