Comment by abeppu

Comment by abeppu 2 days ago

3 replies

> However, as tasks and model behaviors grow more complex, human supervision becomes increasingly unreliable: LMs can learn to mimic mistakes in demonstrations or exploit flaws in feedback. How do we train LMs to do tasks that are too difficult for humans to demonstrate or evaluate reliably?

I didn't read the whole paper but it seems important that you still need real ground truth to measure improvement, so you still need to get real labels at some point. The task they focus on where LLMs have "superhuman" performance is guessing the gender of blog authors. While humans are bad at this, humans are decent as remembering their gender, and a bunch of them are willing to write a blog post, so there's obviously a better way to get supervised examples than asking humans to guess labels: you collect posts in from authors whose gender is known. i.e. "human generated labels are low quality" should not be taken to mean "good labels are not available so we should go fully unsupervised".

So since you already need some real ground truth to know whether your algorithm accomplished anything, I think it's fair to ask: when would you commit to using _all_ your labeled data for evaluation and none for fine tuning, as described in this work? Logical consistency seems valuable, sure, but it seems like really you'd want to use both consistency and some (small?) amount of labeled examples, and a perhaps larger amount of self-labeled examples. In their loop where they revise labels to be more coherent, it seems natural to imagine that pre-provided labels should be stickier than self-generated ones, but not immutable, because there's always some chance of noise in your upstream data generation process.

krackers 2 days ago

Yeah this was my thought after skimming the paper as well. The core of the paper seems to be a way to use LLMs' latent knowledge to create a labeled dataset that can then be used for fine-tuning. They even acknowledge that this can only work to reinforce knowledge that the model already knows, e.g. helping convert pass@k to maj@k.

My immediate thought is how this differs from just naively asking each each question individually to the LLM multiple times and taking the ground truth as consensus majority. The search algorithm probably implicitly does this, though I guess there is some benefit in providing it other related questions as well. I thought I remember similar papers dealing with LLM self-interrogation, using the idea that "true" statements must be logically consistent and so the same underlying explanation should hold for perturbations or related questions as well.

The flaw seems to be that it's still beholden to the training data. Any misconceptions that are internalized during pretraining won't actually be fixed, and in fact they'll only be propagated further.

  • cma 21 hours ago

    Sounds like you are somewhat describing deepseek's GRPO group relative policy optimization. It's in their deep-seek math paper and then got used in the later deepseek models.