What:

We designed this system of RL (albeit very weak form of RL) to keep AI models aligned to human values.

How?

  1. Models go through their normal pretraining. Now models are able to come up with coherent text.
  2. Now this is imperfect - it outputs stuff against our values. The solution? Fine-tune it! Okā€¦ But how?
    1. Using human annotators, rank possible outputs from the models in terms of quality.
    2. Take these outputs, and train another model to be a human annotatorā€¦ Weā€™ll call this the reward model.
    3. The reward model can now judge the base model based on human preferences.
  3. We fine tune the base model, using RL with an algorithm called Proximal Policy Optimisation. This encourages positive, human aligned text.
  4. To ensure we donā€™t lose the coherence of the original base model, we add a penalty for losing coherence.

Controversy šŸ‘€:

Andrej Karpathy has shat on RLHF, saying itā€™s basically not RL. Honestly, Iā€™m inclined to agree. Basically:

  1. The Reward Model looks at the LLMā€™s output. Based on itā€™s general vibe (as decreed by humans), it encourages / discourages it. Itā€™s a crappy proxy (as opposed to the better but abstract goal of ā€œReward it based on how ā€˜correctā€™ it wasā€ - what does it mean to be correct?).
  2. You also donā€™t get the creativity of RL. Itā€™s the traditional AlphaGo vs AlphaZero.