Comment by cma

Comment by cma 20 hours ago

Sounds like you are somewhat describing deepseek's GRPO group relative policy optimization. It's in their deep-seek math paper and then got used in the later deepseek models.