Overview of Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2
This episode is a wide-ranging conversation between swyx, Alessio and Yi Tay covering Yi’s return to Google/Google DeepMind (GDM), the Gemini “DeepThink” work that won IMO gold, research directions (on-policy RL, self-consistency, retrieval-as-generation / DSI), practical impacts of AI tools (AI coding), data‑efficiency and world‑models, and launching a Reasoning & AGI presence in Singapore. The discussion mixes technical intuition, team/process stories (the hackathon-like IMO run), and practical career/hiring advice.
Key topics and takeaways
-
On-policy vs off-policy learning (analogy & philosophy)
- On-policy: models generate trajectories, receive environment rewards and learn from their own behavior — better aligns with discovery and generalization.
- Off-policy: imitation / SFT-style learning (learning from others’ trajectories). Useful initially but has limits.
- Yi’s life analogy: humans start by imitating, then must move to on‑policy experience to generalize.
- Practical takeaway: first imitate (pretrain/SFT), then push models on-policy with environment feedback (RL/RLHF) to improve emergent capabilities.
-
Self-consistency and parallel reasoning
- Generating multiple reasoning traces and aggregating (majority voting, model judges) is linked to on-policy distillation ideas.
- Self-consistency is more nuanced than naive majority voting — internal verification or learned judges can pick better trajectories.
- Sampling multiple chains during training/inference helps robustness of reasoning.
-
IMO gold (Gemini DeepThink) — story and lessons
- The effort to get IMO gold was long-running; this year the team opted for an end‑to‑end single-model approach (Gemini DeepThink) rather than multi-component “alpha proof” pipelines.
- Yi’s role: training the checkpoint used in the live IMO run; the run felt hackathon-like, logistical and inference scaling challenges were significant.
- Decision rationale: testing whether one general model could handle this level of reasoning — a probe for broader AGI aspirations.
- Broader implication: specialized pipelines can be powerful, but pushing a single general model forward is a compelling direction.
-
Retrieval-as-generation / DSI (Decoding Semantic IDs)
- DSI family: encoding documents/items as semantic tokens and having models decode identifiers → applied to retrieval and recommender tasks.
- Generative retrieval has evolved into production‑relevant work (YouTube/Spotify/others exploring semantic IDs).
- Retrieval (ranking, personalization, freshness) is a huge product problem (“the king AI problem”): high business value, complex metrics, and brittle/underappreciated engineering.
-
Pokemon, RL benchmarks, and long-horizon planning
- Pokemon Crystal is an interesting long-horizon benchmark—requires planning, search, web research/trading, and multi-step coordination.
- Yi notes that completing a Pokédex (no lookup) is a harder problem: it requires planning, trading/cooperation, and possibly interacting with real web forums.
-
AI coding and day-to-day productivity
- Yi finds AI coding transformative for routine debugging and boilerplate tasks: sometimes the model fixes bugs he would otherwise spend ~20 minutes on.
- Practical workflow: place the bug/stack trace into a code assistant, accept fixes confidently once you trust the model.
- Positioning: AI as a productivity “buff” (support bard) rather than direct replacement for junior researchers.
-
Models, parameters, and tool use
- Debate: how much can be encoded inside model parameters vs how much needs tool use (external verifiers, provers)? Yi leans toward pushing parameterized models far, while accepting tool use where necessary.
- Open question: where is the boundary of what can be learned inside a single model?
-
Data efficiency and world models
- Concern that tokens (pretraining data) are limited; research should push data efficiency (learn more per token).
- Possible levers:
- Spend more compute per token (more flops/token).
- New learning paradigms (continual learning, world-model fitting) that better compress experience.
- World-model notion is fuzzy — multiple interpretations (video/3D world models, execution-state models for code, latent hypothesis/world-resampling models). Yi emphasizes improving learning algorithms and extracting more signal per datapoint.
-
Architecture, transformers, and future directions
- Transformers / self-attention have been extremely effective and are likely to remain core unless the entire learning paradigm (e.g., backprop) changes.
- But architecture, learning algorithms, data, compute and engineering all matter together. Ideas still matter; progress isn’t only brute-force scaling.
- Open questions: long contexts (millions of tokens), continual learning, memory/network bottlenecks vs compute, and whether we’re in a local optimum with transformers.
-
Labs, openness, and competitive dynamics
- Yi thinks closed-lab advantage (private labs) is increasing: proprietary tricks, tuned systems, and compounding research tricks matter.
- Pretraining is still very relevant; teams continue investing heavily in pretraining despite the RL excitement.
-
GDM Singapore / team building
- Yi and colleagues launched a Gemini Reasoning & AGI presence in Singapore to build talent, cover different time zones, and inspire local community.
- Hiring focus: research taste, strong engineering skill, RL/reasoning or exceptional achievements (competitions/papers). Talent attracts talent; location can matter for recruiting and quality of life.
-
Health & productivity
- Yi shared a personal note: significant weight loss (23 kg over ~1–1.5 years), improved HRV and resting heart rate — links physical wellbeing to sustained research productivity.
Notable quotes & pithy insights
- “If the model can’t get to IMO gold, then can we get to AGI?” — framing IMO as a probe for model generality.
- “On-policy is basically… you generate your own outputs and then learn from the reward given by the environment.” — succinct on-policy definition.
- “AI coding has started to become the point where I run a job, I get a bug, I almost don’t look at the bug — I throw it into the model and it fixes it.” — practical productivity shift.
- “Retrieval is the God problem: ranking, filtering, personalization, re-indexing — it’s where the money is.” — on business importance of retrieval.
- “Ideas matter — we’re not yet in diminishing returns on ideas.” — optimism about future research breakthroughs.
Actionable recommendations (for different audiences)
-
For researchers:
- Explore on-policy RL setups for improving long-horizon reasoning and decision-making capabilities.
- Experiment with self-consistency / multi-chain sampling + learned judges rather than naive majority voting.
- Invest time into data‑efficiency research: flops-per-token tradeoffs, continual learning, and world-model formulations.
- Try DSI-style semantic ID decoders for retrieval / recommender experiments.
-
For engineers and practitioners:
- Embrace AI coding assistants for repetitive debugging, plotting, and glue-code tasks — treat them as productivity multipliers.
- When adopting LLMs for retrieval or recommender problems, start with small semantic ID / gen-retrieval prototypes and evaluate online metrics carefully.
-
For students / job-seekers:
- Demonstrate research taste (well-chosen problems, crisp experiments) and execution. Publish good work online — it can lead to direct recruitment contact.
- Strong engineering and coding skills remain highly valuable; RL/reasoning experience is a plus.
Open questions & debates highlighted
- How far can a single large model (parameters-only) subsume tool-like capabilities (e.g., provers, verifiers)?
- Are transformers/self-attention the final architectural substrate for AGI, or will a new paradigm be needed for orders-of-magnitude longer context/continual learning?
- Where exactly is the “bug” in data efficiency compared to human learners? Is it learning algorithm, architecture, compute allocation per token, world models, or a combination?
- Will closed lab advantage continue to grow, and what does that mean for open research and reproducibility?
Quick summary (3 bullet takeaways)
- Push models from imitation (SFT) to on‑policy experience and use multi-trace verification to improve robust reasoning; this helped Gemini DeepThink reach IMO gold.
- Retrieval (DSI/gen‑retrieval) and data‑efficient learning are major, industrially-important frontiers — both research‑rich and high-value.
- Practical AI impact is here: AI coding is changing workflows and productivity, but core scientific questions (architectures, continual learning, world models) remain wide open.
