Comment by cma
Sounds like you are somewhat describing deepseek's GRPO group relative policy optimization. It's in their deep-seek math paper and then got used in the later deepseek models.
Sounds like you are somewhat describing deepseek's GRPO group relative policy optimization. It's in their deep-seek math paper and then got used in the later deepseek models.