Comment by storus Comment by storus 17 hours ago 1 reply Copy Link View on Hacker News We might not even need RL as DPO has shown.
Copy Link programjames 16 hours ago Collapse Comment - > if you purely use policy optimization, RLHF will be biased towards short horizons> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly Reply View | 0 replies
> if you purely use policy optimization, RLHF will be biased towards short horizons
> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly