Comment by programjames

Comment by programjames 17 hours ago

2 replies

Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.

storus 17 hours ago

We might not even need RL as DPO has shown.

  • programjames 16 hours ago

    > if you purely use policy optimization, RLHF will be biased towards short horizons

    > most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly