~/leocamacho.co

Get Around

🧠 EdinburghAI
Co-founder and President of my University's AI Society
🛠️ Projects
Side projects I've worked on
📝 Essays
Thoughts on AI, startups, and the future

Contact Me

📧 Email
💼 LinkedIn
🐦 Twitter

ChatGPT

Made Oct 14, 2025modified Oct 14, 20252 min read

What:

ChatGPT is a Generative Pretrained Transformer. It’s a decoder-only model.

GPT-1 came out in 2018.
GPT-2 improved on a few things.

Training ChatGPT:

Pretraining LLMs:

Feed the entire internet into the models, getting them (if they’re decoder-only models) to predict the next most likely word.
These models will be very good at produce text that is findable on the internet, but not necessarily helpful for answering questions you may have.

Instruction Fine Tuning:

You get humans to create a corpus of fantastic questions and their answers (i.e. question answer pairs).
You then fine-tune your pretrained LLM to produce its answers in the form of the question-answer pairs.
- This is akin to saying: “Do this for responses of this question type”.
The model can now sorta follow instructions, but it often produces harmful, unhelpful or hallucinated stuff.

Reinforcement Learning from Human Feedback (RLHF), Oversimplified

Using our Instruction fine-tuned model, we produce 2 answers to a prompt, and get a human to rank them against each-other, to be more helpful, honest and harmless.
We take these rankings, and train another model to be a human annotator/ranker… We’ll call this the reward model.
The reward model can now judge the base model based on human preferences. We’ll use that fact to do RL:
1. We give our model a prompt, ask our RM to rate the answer, prize it using Proximal Policy Optimisation (PPO) and then feed that back into the policy.

Graph View

What:
Training ChatGPT:

Backlinks

Encoder-Decoder (ML)
Large Language Models (LLMs)
Reinforcement Learning from Human Feedback (RLHF)
Technology
University of Edinburgh

Created with Quartz v4.4.0 © 2025

GitHub
LinkedIn
Twitter