What:
ChatGPT is a Generative Pretrained Transformer. It’s a decoder-only model.
- GPT-1 came out in 2018.
- GPT-2 improved on a few things.
Training ChatGPT:
Pretraining LLMs:
- Feed the entire internet into the models, getting them (if they’re decoder-only models) to predict the next most likely word.
- These models will be very good at produce text that is findable on the internet, but not necessarily helpful for answering questions you may have.
Instruction Fine Tuning:
- You get humans to create a corpus of fantastic questions and their answers (i.e. question answer pairs).
- You then fine-tune your pretrained LLM to produce its answers in the form of the question-answer pairs.
- This is akin to saying: “Do this for responses of this question type”.
- The model can now sorta follow instructions, but it often produces harmful, unhelpful or hallucinated stuff.
Reinforcement Learning from Human Feedback (RLHF), Oversimplified
- Using our Instruction fine-tuned model, we produce 2 answers to a prompt, and get a human to rank them against each-other, to be more helpful, honest and harmless.
- We take these rankings, and train another model to be a human annotator/ranker… We’ll call this the reward model.
- The reward model can now judge the base model based on human preferences. We’ll use that fact to do RL:
- We give our model a prompt, ask our RM to rate the answer, prize it using Proximal Policy Optimisation (PPO) and then feed that back into the policy.