Comment by neom

Comment by neom 17 hours ago

4 replies

Testing at these labs training big models must be wild, it must be so much work to train a "soul" into a model, run it in a lot of scenarios, the venn between the system prompts etc, see what works and what doesn't... I suppose try to guess what in the "soul source" is creating what effects as the plinko machine does it's thing, going back and doing that over and over... seems like it would be exciting and fun work but I wonder how much of this is still art vs science?

It's fun to see these little peaks into that world, as it implies to me they are getting really quite sophisticated about how these automatons are architected.

ACCount37 16 hours ago

The answer is "yes". To be really really good at training AIs, you need everyone.

Empirical scientists with good methodology who can set up good tests and benchmarks to make sure everyone else isn't flying blind. ML practitioners who can propose, implement and excruciatingly debug tweaks and new methods, and aren't afraid of seeing 9.5 out of 10 their approaches fail. Mechanistic interpretability researchers who can peer into model internals, figure out the practical limits and get rare but valuable glimpses of how LLMs do what they do. Data curation teams who select what data sources will be used for pre-training and SFT, what new data will be created or acquired and then fed into the training pipeline. Low level GPU specialists that can set up the infrastructure for the training runs and make sure that "works on my scale (3B test run)" doesn't go to shreds when you try a frontier scale LLM. AI-whisperers, mad but not too mad, who have experience with AIs, possess good intuitions about actual AI behavior, can spot odd behavioral changes, can get AIs to do what they want them to do, and can translate that strange knowledge to capabilities improved or pitfalls avoided.

Very few AI teams have all of that, let alone in good balance. But some try. Anthropic tries.

simonw 17 hours ago

The most detail I've seen of this process is still from OpenAI's postmortem on their sycophantic GPT-4o update: https://openai.com/index/expanding-on-sycophancy/

  • neom 16 hours ago

    I hadn't seen this, thanks for sharing. So basically the reward of the model was to reward the user, and the user used the model to "reward" itself (the user).

    Being generous, they poorly implemented/understood how the reward mechanisms abstract and instantiated out to the user such that they become a compounding loop, my understanding was it became particularly true in very long lived conversations.

    This makes me want a transparency requirement on how the reward mechanisms in the model I am using at any given moment are considered by whoever built it, so I, the user can consider them also, maybe there is some nuance in "building a safe model" vs "building a model the user can understand the risks around"? Interesting stuff! As always, thanks for publishing very digestible information Simon.

    • ACCount37 15 hours ago

      It's not just OpenAI's fuckup with the specific training method - although yes, training on raw user feedback is spectacularly dumb, and it's something even the teams at CharacterAI learned the hard way at least a year before OpenAI shoot its foot off with the same genius idea.

      It's also a bit of a failure to understand that many LLM behaviors are self-reinforcing across context, and keep tabs on that.

      When the AI sees its past behavior, that shapes its future behavior. If an AI sees "I'm doing X", it may also see that as "I should be doing X more". And at long enough contexts, this can drastically change AI behavior. Small random deviations can build up to crushing behavioral differences.

      And if AI has a strong innate bias - like a sycophancy bias? Oh boy.

      This applies to many things, some of which we care about (errors, hallucinations, unsafe behavior) and some of which we don't (specific formatting, message length, terminology and word choices).