Comment by krackers

Comment by krackers 2 days ago

1 reply

Yeah this was my thought after skimming the paper as well. The core of the paper seems to be a way to use LLMs' latent knowledge to create a labeled dataset that can then be used for fine-tuning. They even acknowledge that this can only work to reinforce knowledge that the model already knows, e.g. helping convert pass@k to maj@k.

My immediate thought is how this differs from just naively asking each each question individually to the LLM multiple times and taking the ground truth as consensus majority. The search algorithm probably implicitly does this, though I guess there is some benefit in providing it other related questions as well. I thought I remember similar papers dealing with LLM self-interrogation, using the idea that "true" statements must be logically consistent and so the same underlying explanation should hold for perturbations or related questions as well.

The flaw seems to be that it's still beholden to the training data. Any misconceptions that are internalized during pretraining won't actually be fixed, and in fact they'll only be propagated further.

cma 21 hours ago

Sounds like you are somewhat describing deepseek's GRPO group relative policy optimization. It's in their deep-seek math paper and then got used in the later deepseek models.