Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

Summary of Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

by swyx + Alessio

1h 32mJanuary 23, 2026

Overview of Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

This episode is a wide-ranging conversation between swyx, Alessio and Yi Tay covering Yi’s return to Google/Google DeepMind (GDM), the Gemini “DeepThink” work that won IMO gold, research directions (on-policy RL, self-consistency, retrieval-as-generation / DSI), practical impacts of AI tools (AI coding), data‑efficiency and world‑models, and launching a Reasoning & AGI presence in Singapore. The discussion mixes technical intuition, team/process stories (the hackathon-like IMO run), and practical career/hiring advice.

Key topics and takeaways

  • On-policy vs off-policy learning (analogy & philosophy)

    • On-policy: models generate trajectories, receive environment rewards and learn from their own behavior — better aligns with discovery and generalization.
    • Off-policy: imitation / SFT-style learning (learning from others’ trajectories). Useful initially but has limits.
    • Yi’s life analogy: humans start by imitating, then must move to on‑policy experience to generalize.
    • Practical takeaway: first imitate (pretrain/SFT), then push models on-policy with environment feedback (RL/RLHF) to improve emergent capabilities.
  • Self-consistency and parallel reasoning

    • Generating multiple reasoning traces and aggregating (majority voting, model judges) is linked to on-policy distillation ideas.
    • Self-consistency is more nuanced than naive majority voting — internal verification or learned judges can pick better trajectories.
    • Sampling multiple chains during training/inference helps robustness of reasoning.
  • IMO gold (Gemini DeepThink) — story and lessons

    • The effort to get IMO gold was long-running; this year the team opted for an end‑to‑end single-model approach (Gemini DeepThink) rather than multi-component “alpha proof” pipelines.
    • Yi’s role: training the checkpoint used in the live IMO run; the run felt hackathon-like, logistical and inference scaling challenges were significant.
    • Decision rationale: testing whether one general model could handle this level of reasoning — a probe for broader AGI aspirations.
    • Broader implication: specialized pipelines can be powerful, but pushing a single general model forward is a compelling direction.
  • Retrieval-as-generation / DSI (Decoding Semantic IDs)

    • DSI family: encoding documents/items as semantic tokens and having models decode identifiers → applied to retrieval and recommender tasks.
    • Generative retrieval has evolved into production‑relevant work (YouTube/Spotify/others exploring semantic IDs).
    • Retrieval (ranking, personalization, freshness) is a huge product problem (“the king AI problem”): high business value, complex metrics, and brittle/underappreciated engineering.
  • Pokemon, RL benchmarks, and long-horizon planning

    • Pokemon Crystal is an interesting long-horizon benchmark—requires planning, search, web research/trading, and multi-step coordination.
    • Yi notes that completing a Pokédex (no lookup) is a harder problem: it requires planning, trading/cooperation, and possibly interacting with real web forums.
  • AI coding and day-to-day productivity

    • Yi finds AI coding transformative for routine debugging and boilerplate tasks: sometimes the model fixes bugs he would otherwise spend ~20 minutes on.
    • Practical workflow: place the bug/stack trace into a code assistant, accept fixes confidently once you trust the model.
    • Positioning: AI as a productivity “buff” (support bard) rather than direct replacement for junior researchers.
  • Models, parameters, and tool use

    • Debate: how much can be encoded inside model parameters vs how much needs tool use (external verifiers, provers)? Yi leans toward pushing parameterized models far, while accepting tool use where necessary.
    • Open question: where is the boundary of what can be learned inside a single model?
  • Data efficiency and world models

    • Concern that tokens (pretraining data) are limited; research should push data efficiency (learn more per token).
    • Possible levers:
      • Spend more compute per token (more flops/token).
      • New learning paradigms (continual learning, world-model fitting) that better compress experience.
    • World-model notion is fuzzy — multiple interpretations (video/3D world models, execution-state models for code, latent hypothesis/world-resampling models). Yi emphasizes improving learning algorithms and extracting more signal per datapoint.
  • Architecture, transformers, and future directions

    • Transformers / self-attention have been extremely effective and are likely to remain core unless the entire learning paradigm (e.g., backprop) changes.
    • But architecture, learning algorithms, data, compute and engineering all matter together. Ideas still matter; progress isn’t only brute-force scaling.
    • Open questions: long contexts (millions of tokens), continual learning, memory/network bottlenecks vs compute, and whether we’re in a local optimum with transformers.
  • Labs, openness, and competitive dynamics

    • Yi thinks closed-lab advantage (private labs) is increasing: proprietary tricks, tuned systems, and compounding research tricks matter.
    • Pretraining is still very relevant; teams continue investing heavily in pretraining despite the RL excitement.
  • GDM Singapore / team building

    • Yi and colleagues launched a Gemini Reasoning & AGI presence in Singapore to build talent, cover different time zones, and inspire local community.
    • Hiring focus: research taste, strong engineering skill, RL/reasoning or exceptional achievements (competitions/papers). Talent attracts talent; location can matter for recruiting and quality of life.
  • Health & productivity

    • Yi shared a personal note: significant weight loss (23 kg over ~1–1.5 years), improved HRV and resting heart rate — links physical wellbeing to sustained research productivity.

Notable quotes & pithy insights

  • “If the model can’t get to IMO gold, then can we get to AGI?” — framing IMO as a probe for model generality.
  • “On-policy is basically… you generate your own outputs and then learn from the reward given by the environment.” — succinct on-policy definition.
  • “AI coding has started to become the point where I run a job, I get a bug, I almost don’t look at the bug — I throw it into the model and it fixes it.” — practical productivity shift.
  • “Retrieval is the God problem: ranking, filtering, personalization, re-indexing — it’s where the money is.” — on business importance of retrieval.
  • “Ideas matter — we’re not yet in diminishing returns on ideas.” — optimism about future research breakthroughs.

Actionable recommendations (for different audiences)

  • For researchers:

    • Explore on-policy RL setups for improving long-horizon reasoning and decision-making capabilities.
    • Experiment with self-consistency / multi-chain sampling + learned judges rather than naive majority voting.
    • Invest time into data‑efficiency research: flops-per-token tradeoffs, continual learning, and world-model formulations.
    • Try DSI-style semantic ID decoders for retrieval / recommender experiments.
  • For engineers and practitioners:

    • Embrace AI coding assistants for repetitive debugging, plotting, and glue-code tasks — treat them as productivity multipliers.
    • When adopting LLMs for retrieval or recommender problems, start with small semantic ID / gen-retrieval prototypes and evaluate online metrics carefully.
  • For students / job-seekers:

    • Demonstrate research taste (well-chosen problems, crisp experiments) and execution. Publish good work online — it can lead to direct recruitment contact.
    • Strong engineering and coding skills remain highly valuable; RL/reasoning experience is a plus.

Open questions & debates highlighted

  • How far can a single large model (parameters-only) subsume tool-like capabilities (e.g., provers, verifiers)?
  • Are transformers/self-attention the final architectural substrate for AGI, or will a new paradigm be needed for orders-of-magnitude longer context/continual learning?
  • Where exactly is the “bug” in data efficiency compared to human learners? Is it learning algorithm, architecture, compute allocation per token, world models, or a combination?
  • Will closed lab advantage continue to grow, and what does that mean for open research and reproducibility?

Quick summary (3 bullet takeaways)

  • Push models from imitation (SFT) to on‑policy experience and use multi-trace verification to improve robust reasoning; this helped Gemini DeepThink reach IMO gold.
  • Retrieval (DSI/gen‑retrieval) and data‑efficient learning are major, industrially-important frontiers — both research‑rich and high-value.
  • Practical AI impact is here: AI coding is changing workflows and productivity, but core scientific questions (architectures, continual learning, world models) remain wide open.