What:

ChatGPT is a Generative Pretrained Transformer. It’s a decoder-only model.

  1. GPT-1 came out in 2018.
  2. GPT-2 improved on a few things.

Training ChatGPT:

Pretraining LLMs:

  1. Feed the entire internet into the models, getting them (if they’re decoder-only models) to predict the next most likely word.
  2. These models will be very good at produce text that is findable on the internet, but not necessarily helpful for answering questions you may have.

Instruction Fine Tuning:

  1. You get humans to create a corpus of fantastic questions and their answers (i.e. question answer pairs).
  2. You then fine-tune your pretrained LLM to produce its answers in the form of the question-answer pairs.
    • This is akin to saying: “Do this for responses of this question type”.
  3. The model can now sorta follow instructions, but it often produces harmful, unhelpful or hallucinated stuff.

Reinforcement Learning from Human Feedback (RLHF), Oversimplified

  1. Using our Instruction fine-tuned model, we produce 2 answers to a prompt, and get a human to rank them against each-other, to be more helpful, honest and harmless.
  2. We take these rankings, and train another model to be a human annotator/ranker… We’ll call this the reward model.
  3. The reward model can now judge the base model based on human preferences. We’ll use that fact to do RL:
    1. We give our model a prompt, ask our RM to rate the answer, prize it using Proximal Policy Optimisation (PPO) and then feed that back into the policy.