What:

We designed this system of RL (albeit very weak form of RL) to keep AI models aligned to human values. Great conceptual explainer video is here.

Overview, Oversimplified:

(More details here)
0. Models go through their normal pretraining. Now models are able to come up with coherent text. But this is imperfect - it outputs stuff against our values. The solution? Fine-tune it! Ok… But how?

  1. Using human annotators, create examples of good answer-response pairs. Fine-tune your LLM to produce answers like them. Now, get it to produce 2 answers to a prompt, and get a human to rank them against each-other.
  2. Take these rankings, and train another model to be a human annotator/ranker… We’ll call this the reward model.
  3. The reward model can now judge the base model based on human preferences. We’ll use that fact to do RL:
    1. We give our model a prompt, ask our RM to rate the answer, prize it using Proximal Policy Optimisation (PPO) and then feed that back into the policy.

The RL, a bit more in-depth:

  1. Setup, you have:
    • A frozen, initial language model (base LM)
    • A Tuned Language Model (being updated with PPO), aka a policy.
    • A model trained to act like a human annotator (a reward model)
  2. You take a prompt. ā€œA dog isā€¦ā€
  3. You get both models to generate outputs
  4. You compute how much they differ / diverge by, using KL Divergence.
  5. You also, simultaneously, feed your policy’s output into the Reward Model.
  6. You combine the KL divergence with the reward model.
    • It’s literally a sum: . You reward it for it for being aligned with human preferences, and punish it for a higher divergence from the Base LM.
  7. You take the total reward signal , and pass that into the PPO algorithm to see by how much the model has to update. Since we’re trying to maximise, not minimise, it’s actually gradient ascent.
  8. Repeat. Repeat it a lot.

Why Include The Base LLM?

Imagine we didn’t include the base LLM, and so had no KL divergence. Well then the policy would learn to just output incoherent (but safe!) text. E.g. ā€œpuppies love happy playgroundsā€ etc.

Note on RM:

There’s a formula that you can use that takes a preference from A→B and returns a scalar.

Why Bother with Reinforcement Learning?

Have you ever thought about it? Why not just fine-tune your model to the user’s preferences? It would be a lot simpler no?

  • Fine-tuning: is akin to telling the model ā€œhere’s the correct answer, memorise itā€. This is an essential part of the process. Learn more about it here.
  • RLHF: is akin to telling the model ā€œhere are two options, humans like this one more. Learn from thatā€.

With fine-tuning, you may have to give multiple examples of every possible behaviour we don’t want. RLHF learns the pattern of what not to respond to. Also, it implies that the human annotated response is the best possible response. RLHF leaves flexibility to learn something that humans prefer more.

Controversy šŸ‘€:

Andrej Karpathy has shat on RLHF, saying it’s basically not RL. Honestly, I’m inclined to agree. Basically:

  1. The Reward Model looks at the LLM’s output. Based on it’s general vibe (as decreed by humans), it encourages / discourages it. It’s a crappy proxy (as opposed to the better but abstract goal of ā€œReward it based on how ā€˜correct’ it wasā€ - what does it mean to be correct?).
  2. You also don’t get the creativity of RL. It’s the traditional AlphaGo vs AlphaZero.

AKA Instruction Tuning

Models that have been instruction tuned perform better to unseen tasks. Interesting innit.