Overview of From Atari to ChatGPT: How AI Learned to Follow Instructions
This Linear Digressions episode (hosts Ben Jaffe and Katie Malone) traces the research arc from early reinforcement-learning experiments (Atari, simulated robots) to the arrival of ChatGPT in late 2022. It explains why large pretrained language models (e.g., GPT‑3) that predict the next token needed a new training recipe—reinforcement learning from human feedback (RLHF) and reward modeling—to reliably follow user instructions and behave helpfully. The episode covers key papers, the practical training steps behind InstructGPT/ChatGPT, and the ethical/technical caveats that come with using a small, curated set of human labelers to shape AI behavior.
Key takeaways
- Pretrained LMs (GPT family) are trained to predict the next token; that objective is not the same as “be helpful.” This objective mismatch is the core misalignment problem.
- Bigger models and more data alone do not solve the misalignment; you need a different training signal to get instruction-following behavior.
- Reinforcement learning from human feedback (RLHF) + a learned reward model enabled LMs to generalize helpful behavior across many tasks.
- InstructGPT/ChatGPT resulted from combining supervised fine-tuning, human pairwise preference data, a reward model, and RL — and it performs better at following instructions despite being much smaller than the largest GPT‑3 checkpoints.
- The human labelers’ demographics, guidelines, and inter-annotator disagreement shape the model’s behavior and introduce cultural/ethical biases.
Research timeline (2017–2022)
- 2017 — Paul Christiano et al.: introduced training agents from human preference comparisons (pairwise labels) and the reward-model idea; experiments in Atari and simulated robotics showed it can scale with minimal human labeling.
- 2019 — Early text experiments: applied preference learning to stylistic edits (e.g., make text more positive); used GPT‑2-sized models.
- 2020 — Summarization work (Stiennon et al. / related teams): showed models trained with human preferences can outperform supervised baselines on summarization and even surpass human reference summaries in some metrics.
- 2022 — InstructGPT (Ouyang et al.): combined supervised fine-tuning, human-generated ideal responses, pairwise preference collection (~33k comparisons), reward model training, and RL to create models that follow instructions robustly. This is the immediate precursor to ChatGPT.
How InstructGPT / ChatGPT training works (high-level steps)
- Supervised dataset: collect a set of human-written “ideal” responses to sample prompts (OpenAI used ~13k prompt-response pairs).
- Supervised fine-tuning: fine-tune the pretrained model on that dataset to get an initial instruction-following policy.
- Preference data: for many prompts, generate multiple candidate responses, show pairs to human labelers, and collect pairwise preference judgments (~tens of thousands of comparisons).
- Reward model: train a separate model to predict human preferences (the reward model).
- RL optimization: use reinforcement learning (optimize the policy to maximize the reward model score) so the model learns to produce outputs that the reward model—and by extension humans—prefer.
Why this approach matters (and what it isn’t)
- It turns the fuzzy notion of “helpful, honest, harmless” into a learnable signal via human comparisons, which humans find easier and more consistent than scalar ratings.
- The model learns to mimic human tastes in responses, so it will appear conversational and instruction-following rather than just “autocomplete.”
- This pipeline improves output quality and generalization with far less parameter size — InstructGPT variants could be much smaller than the largest GPT‑3 models yet be better at following instructions.
- Important caveat: this does not mean the model “reasoned” its way to answers—rather, it learned to produce answers humans prefer. Hallucinations and failure modes remain possible.
Limitations, biases, and ethical considerations
- Labeler influence: OpenAI used a small, curated group (~40 contractors) screened for sensitivity to harmful content and primarily English-speaking. Their judgments heavily influence model behavior.
- Inter-annotator variability: agreement on pairwise choices was about 73% (i.e., ~27% disagreement), highlighting subjective judgments and uncertainty.
- Cultural and language bias: training and labeling focused on English and specific cultural norms, which can embed biases and reduce model quality for other languages/cultures.
- Scale/ownership of feedback: users of these models often become implicit labelers (A/B feedback UI), raising questions about who benefits from user-provided feedback and consent.
- Misalignment persists: while RLHF improves instruction following, it doesn’t fully solve hallucination, adversarial prompts, or deeper alignment challenges.
Notable quotes / useful lines from the episode
- “Predict the next token” vs. “do what the user wants” — framing of the misalignment problem.
- “Humans are bad at scoring outputs on a 1 to 10 scale… but pretty good if you present them with two options” — core insight behind preference learning.
- “InstructGPT is way smaller… it’s not that it’s smarter, it’s just better at following instructions.” — illustrates that objective/function matters more than raw scale for some capabilities.
Action items & further reading
- If you want the original papers the episode references, look up:
- Paul Christiano et al., “Deep Reinforcement Learning from Human Preferences” (2017 / arXiv)
- Stiennon et al., “Learning to Summarize with Human Feedback” (2020)
- Ouyang et al., “Training language models to follow instructions with human feedback” (OpenAI, 2022) — InstructGPT paper
- Visit LinearDigressions.com (episode show notes) for links and full bibliographic references.
- When using or building systems with RLHF:
- Be explicit about labeler demographics and guidelines.
- Monitor inter-annotator disagreement and collect more data where uncertainty is high.
- Test performance across languages and cultures; don’t assume English-centric behavior generalizes.
For readers who don’t want to listen to the full episode: this covers the essence — how the field moved from simulated control tasks to preference-based fine-tuning and RLHF, and why that shift produced the interactive, instruction-following behavior we now see in ChatGPT-style systems.