What:
We designed this system of RL (albeit very weak form of RL) to keep AI models aligned to human values.
How?
- Models go through their normal pretraining. Now models are able to come up with coherent text.
- Now this is imperfect - it outputs stuff against our values. The solution? Fine-tune it! Okā¦ But how?
- Using human annotators, rank possible outputs from the models in terms of quality.
- Take these outputs, and train another model to be a human annotatorā¦ Weāll call this the reward model.
- The reward model can now judge the base model based on human preferences.
- We fine tune the base model, using RL with an algorithm called Proximal Policy Optimisation. This encourages positive, human aligned text.
- To ensure we donāt lose the coherence of the original base model, we add a penalty for losing coherence.
Controversy š:
Andrej Karpathy has shat on RLHF, saying itās basically not RL. Honestly, Iām inclined to agree. Basically:
- The Reward Model looks at the LLMās output. Based on itās general vibe (as decreed by humans), it encourages / discourages it. Itās a crappy proxy (as opposed to the better but abstract goal of āReward it based on how ācorrectā it wasā - what does it mean to be correct?).
- You also donāt get the creativity of RL. Itās the traditional AlphaGo vs AlphaZero.