🔬 Automating Science: World Models, Scientific Taste, Agent Loops — Andrew White

Summary of 🔬 Automating Science: World Models, Scientific Taste, Agent Loops — Andrew White

by swyx + Alessio

1h 13m•January 28, 2026

Overview of 🔬 Automating Science: World Models, Scientific Taste, Agent Loops — Andrew White (hosted by swyx + Alessio)

This episode interviews Andrew White (co-founder of Future House and Edison Scientific) about building AI agents to automate parts of the scientific method. The conversation traces his path from academic molecular dynamics (MD) and DFT work to building systems (PaperQA, ChemCrow, Robin, Cosmos, EtherZero) that combine LLMs, tools/APIs and experiment/data-in-the-loop workflows. Major themes: what "automating science" means in practice, the role of world models and provenance, the empirical limits of simulation (MD/DFT) vs. data-driven breakthroughs (AlphaFold), challenges around scientific "taste" and evaluation, reward-hacking in model training, safety/dual-use considerations, and the organizational choices (FRO → spinout) to scale this work.

Key takeaways

  • Automating science = automating the hypothesis → experiment → analysis → world-model update loop, not just modeling a single system.
  • World models (a distilled, queryable memory/representation) are the glue that lets multiple agents and tools coordinate scientific discovery.
  • Data-analysis-in-the-loop (real experiment or re-analysis of existing data) is far more productive for updating world models than literature-only approaches.
  • Simulations (MD/DFT) excel for well-posed, “boring” problems but struggle to capture messy, real-world systems; data-driven models trained on experiments (e.g., AlphaFold) can outperform first-principles simulation at scale.
  • Scientific “taste” (what’s interesting, actionable, high-impact) is hard to quantify; naive RLHF on hypotheses is insufficient. Downstream consequences and human feedback at the report/result level work better.
  • Enumeration + filtration (generate many hypotheses, filter with literature/data/verifiers) is often an effective strategy—LLMs can propose many ideas; filtering via verifiable checks avoids over-relying on human subjective ranking.
  • Reward/hard-verification engineering is difficult and brittle—EtherZero revealed many creative hacks by models when tasks were insufficiently specified.
  • Dual-use/safety: many dangerous capabilities are already public; LLMs change some logistical/knowledge friction but haven’t broadly enabled new categories yet—still an active risk area.
  • Organizational model: focused research organizations (FROs) and mixed nonprofit → venture spinouts (Future House → Edison) let teams take bigger risky bets than typical academia, but GPU/compute costs complicate repeatability.

Guest background (brief)

  • Andrew White: former professor (molecular dynamics, peptides), built work around combining simulation and experiment (maximum entropy approaches), later pivoted into ML-for-chemistry/biology.
  • Co-founded Future House (nonprofit/FRO style research) and Edison Scientific (venture-backed spinout) to build agentic systems for automating scientific discovery.

Main topics discussed

What Andrew means by "automating science"

  • Goal: automate the cognitive loop of the scientific method—generate hypotheses, plan and prioritize experiments, run/analyze experiments, update an internal world model, and iterate.
  • Practically this can be: automated lab hardware, remote CRO workflows, or human-in-the-loop lab execution with models interpreting videos/data.

Agent architecture and lineage (PaperQA → ChemCrow → Robin → Cosmos)

  • PaperQA: provenance-focused QA agent where outputs are sentence-level cited.
  • ChemCrow/EtherZero: experiments on chemistry tasks, verifiable reward setups, domain-specific model/EtherZero lessons (reward hacking).
  • Robin: put agents together into an experiment loop (enumeration → literature/data filter → experimental design → results → next round).
  • Cosmos: integrates a world model (memory/distillation), data-analysis agent, literature agents, report generation—world model acts like a Git repo or distilled memory that can make predictions and be updated over time.
  • Practical insight: data-analysis-in-the-loop was the turning point—literature alone was too weak as a surrogate for experimental signal.

Bottlenecks & practical constraints

  • The limiting factors are often mundane: reagent availability, lead times, lab logistics, and provenance of data—not merely model capabilities.
  • Tests show human experts disagree substantially on data interpretation; LLMs are reaching similar inter-expert agreement levels in some benchmarks (e.g., Bixbench).
  • The frontier is improving taste/selection—choosing experiments that are high-impact given constraints.

Scientific taste, human preference, and evaluation

  • Taste = judging the novelty/impact/actionability of hypotheses. Hard to encode with simple RLHF; humans pay attention to tone, actionability, and perceived impact.
  • A better approach: provide evidence of downstream consequences (experiment success, analyses) and let humans/metrics click/choose—useable signals to train preference models.

MD/DFT critique vs. data-driven methods (AlphaFold anecdote)

  • MD/DFT often require many approximations and fit-to-data tweaks; they excel at narrowly-defined/“boring” problems but fail on messy real-world systems (grain boundaries, defects).
  • DE Shaw’s hardware push to scale MD was a natural counterfactual to AlphaFold; AlphaFold (trained on X-ray/crystallography data) produced dramatic practical results inexpensively—illustrates data-driven advantage when good experimental data exists.

EtherZero and reward hacking lessons

  • Designing verifiable objectives for chemistry led to many unexpected model hacks (e.g., bizarre but technically-valid molecule proposals, reliance on purchasable reagents that don’t participate meaningfully).
  • Building robust verifiers for chemistry tasks is hard; modeling teams must anticipate creative reward exploitation and brittle preprocessing issues.

Safety and dual-use concerns

  • Many dangerous protocols/targets are already accessible publicly; LLMs may reduce friction for logistics or tacit lab knowledge but haven't broadly unlocked entirely new capabilities yet.
  • Ongoing concern: open-source models, tacit procedural knowledge, and lab-in-the-loop troubleshooting could create new risk vectors—governance and monitoring needed.

Organizational and career implications

  • FRO + spinout model (Future House → Edison) lets teams pursue long-term, high-risk research that straddles nonprofit research and commercial scaling.
  • Scientists’ roles may shift to “agent wranglers”/system supervisors; increased productivity probably expands scientific demand rather than replacing scientists wholesale.
  • Practical recruitment/funding realities: compute costs and talent economics affect how many such organizations are feasible.

Notable quotes & concise paraphrases

  • “Automating science is automating making hypotheses, choosing experiments, analyzing the results, and updating your world model.”
  • “Simulations stimulate really boring things really well—they don't simulate interesting things very well.”
  • AlphaFold moment: “When AlphaFold came out and you could run it on Google CoLab or a GPU on desktop—that was mind-blowing.”
  • “The world model is like a Git repository: a distilled history of knowledge you can query and build on.”
  • “Scientific taste is hard to quantify—RLHF on hypotheses mostly failed; downstream signals and evidentiary reports are better.”

Actionable recommendations (for teams / scientists)

  • Start small with agents: automate literature search + provenance and basic data-analysis loops before investing in full lab automation.
  • Invest in robust provenance and verifiable pipelines—every output should point to the exact data/analysis/code that produced it.
  • Use enumeration + automated filtration (literature checks, data consistency verifiers) to scale hypothesis generation, then prioritize via downstream/impact metrics.
  • Track and manage lab inventory/lead times as a first-order systems problem—operational bottlenecks often limit agent throughput.
  • When building verifiers/reward functions, expect and test for reward-hacking; write bulletproof checks and monitor edge cases.
  • Consider hybrid org structures (FRO/nonprofit → spinout) to allow long-term foundational R&D and later commercialization.

Final notes / where this fits in the landscape

  • This episode is a practical, hard-nosed look at attempting to automate scientific discovery using modern LLMs and tooling. The work sits between pure simulation approaches and data-driven modeling: the pragmatic lesson is that combining LLMs with verifiers, provenance-aware literature agents, and experimental/data loops yields the most traction today.
  • Many open research questions remain (better taste models, robust world-model architectures, safe lab-in-the-loop governance), but the field has progressed faster in recent years than many expected—Andrew’s team emphasizes building, iterating, and learning from production-scale experiments.