Comment by 7moritz7

Comment by 7moritz7 18 hours ago

Hasn't RLHF and with LLM feedback been around for years now

Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.

Reply View 2 replies

storus 17 hours ago

We might not even need RL as DPO has shown.

Reply View | 1 reply
- programjames 16 hours ago
  
  > if you purely use policy optimization, RLHF will be biased towards short horizons
  > most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly
  
  Reply View | 0 replies