Overview of The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI
This episode of the Latent Space Pod features Myra Deng and Mark Bissell from Goodfire AI (hosted by swyx + Alessio). Goodfire is an AI research lab focused on mechanistic interpretability (mechinterp) applied to real-world, production use cases. The conversation covers what interpretability means in practice, Goodfire’s product and research approach, a live steering demo on a trillion‑parameter model, production deployments (notably with Rakuten), applications in life sciences (including biomarkers for Alzheimer’s), limits of current methods (SAEs, probes), and calls for design partners and hires. The episode also announces Goodfire’s Series B ($150M at $1.25B valuation).
Key takeaways
- Goodfire’s mission: bring interpretability methods from research into production to design, understand and control AI models safely and effectively.
- Interpretability is broad: not only post-hoc diagnostics, but also model design, training-time interventions, and scalable oversight.
- Practical product work: Goodfire ships APIs and tools (steering, SAE/probe tooling) that can run at frontier model scale.
- Real deployments: Rakuten uses Goodfire for inference‑time guardrails (PII detection, multilingual challenges, token-level scrubbing).
- Method tradeoffs:
- SAEs (sparse autoencoders) are powerful unsupervised tools but can be noisy: in some use cases probes on raw activations outperform SAEs; in noisy datasets SAEs generalize well.
- Probes are often more precise for targeted supervised tasks (e.g., removing hallucination behavior).
- Known issues: feature splitting, feature absorption and off‑target effects when editing model internals.
- Steering works in real time on very large models (demo on a 1T-parameter Kimi/K2 model): you can edit activation vectors to change style/persona or behavioral tendencies, but achieving high-level capabilities (e.g., legal reasoning) requires more sophisticated interventions and algorithmic breakthroughs.
- Mechinterp is an accessible, productive field: moderate compute budgets can let newcomers experiment and produce meaningful research.
Topics discussed
- What is interpretability?
- Multiple definitions across the field; Goodfire uses interpretability as a toolbox for understanding internal representations, designing models, and controlling behavior.
- Applications span data curation, post‑training editing, training‑time interventions, and model customization.
- Research → Product workflow at Goodfire
- Talk to customers → identify real-world failure modes → translate to research agenda → iterate on tooling and platform.
- SAE vs probes
- When SAEs help and where they fall short; probes on raw activations sometimes perform better for supervised detection tasks.
- Steering and equivalence to prompting/in-context learning
- Research shows quantitative relationships between activation steering and in‑context learning; potential to map steering magnitude → behavioral change.
- Production challenges
- Multilingual tokenization, inability to train on PII (synthetic-to-real transfer), need for token‑level classification, latency/compute tradeoffs.
- Use cases and domains
- Language, code, reasoning, diffusion/image models, world/video models, robotics, genomics, medical imaging, drug discovery.
- Safety & alignment
- Goodfire’s view: scalable oversight and interpretability are critical technical paths to safer deployment; engagement with the broader safety/interpretability community is active and collaborative.
Notable demo & technical details
- Real‑time steering demo:
- Steering vectors were applied to particular layers/features of a 1T-parameter model (Kimi K2) to induce a “Gen Z slang” persona in outputs.
- Demonstrated that steering can change model demeanor and chain of thought while preserving tool‑calling behavior.
- Implementation notes: collecting activations, training SAEs, labeling features by inspecting top firing examples (sometimes automated via LLMs), and applying activation edits at inference.
- Practical implications:
- Steering can be used for stylistic edits, concision, and potentially behavior adjustments (e.g., reducing hallucination), but precision and off‑target risk remain.
- Probes + supervised data often better for targeted behavior removal; SAEs excel in noisy-data scenarios.
- Efficiency advantage:
- Interpretability-based detectors/probes are often much cheaper and lower-latency than running a separate judging LLM for guardrails.
Production use cases and customers
- Rakuten:
- Goodfire deployed an inference‑time pipeline to detect and scrub PII across queries (English + Japanese), enforcing privacy guardrails and avoiding routing private data to other providers.
- Challenges included synthetic training data (can't train on customer PII), multilingual tokenization quirks, and token-level annotation needs.
- Life sciences & healthcare:
- Partnerships include Mayo Clinic and Prima Menta (neurodegenerative disease). Goodfire applied interpretability to find novel biomarkers for Alzheimer’s.
- Use cases: debugging models (detecting spurious correlations), extracting novel, actionable scientific insights, and making domain models more trustworthy for clinical contexts.
- Other domains sought:
- Reasoning and code models, world/video/world‑model tasks, robotics, materials science, and any narrow foundation models where internal feedback is hard to get.
Limitations, open problems & research directions
- Known limitations:
- SAE unsupervised features can split/absorb concepts; not guaranteed to align with human concepts needed for downstream tasks.
- Editing internals risks off‑target effects (e.g., reducing hallucination but harming creativity).
- Many post‑training edits are coarse; surgical control remains an open challenge.
- Open problems and areas to contribute:
- Better supervised+unsupervised hybrids for feature discovery and labeling.
- Methods to perform intentional model design during training (not just post‑hoc).
- Scalable, robust detectors for hallucination and other safety‑critical behaviors.
- Interpretability tools for non‑language domains (video, world models, genomics).
- Bridging weak→strong generalization concerns (how to have weaker models help understand/control stronger models).
- Recommended readings & resources mentioned:
- Lee Sharkey — “Open Problems in Interpretability” (overview of community challenges).
- Paper: “Belief dynamics reveal the dual nature of in‑context learning and activation steering” (on quantitative relationship between steering and prompting/in‑context learning).
- NeuronPedia and other visualization tools for neuron/feature inspection.
- Community programs/fellowships: MATS (Machine Learning and Alignment/ML & Alignment Theory Scholars), Anthropic Fellows and similar.
Notable quotes & insights (paraphrased)
- “Interpretability will unlock the next frontier of safe and powerful AI models.” — Goodfire’s working thesis.
- “We want to take interpretability from the research world into the real world.” — emphasis on production impact.
- “SAEs give you a peek into the AI’s mind, but sometimes you wish you saw other things.” — on the pros/cons of unsupervised features.
- “If you have an LLM that’s almost good enough but needs one magical knob to tune, that’s the kind of problem we want to solve.” — use case framing for design partners.
Calls to action / how to engage
- Goodfire is hiring (research scientists, MLEs, product/engineering roles) and seeking design partners across language, reasoning, world models, robotics, and life sciences.
- If you have near‑ready foundation models that need one key improvement (safety, hallucination reduction, token-level control), consider contacting Goodfire as a design partner.
- For newcomers to mechinterp:
- Read “Open Problems in Interpretability.”
- Try hands‑on SAE and probe experiments (moderate compute needed; many open-source notebooks exist).
- Join the mechinterp/ML interpretability communities (slack/forums, NeuronPedia, academic workshops).
- Explore fellowship and training programs like MATS.
Resources & links mentioned (to search)
- Goodfire AI (careers / product pages)
- Rakuten guardrail deployment (case example)
- Paper: “Belief dynamics reveal the dual nature of in‑context learning and activation steering”
- Lee Sharkey — Open problems in interpretability
- NeuronPedia visualization tools
- PaintDirectFair.ai (Goodfire diffusion demo / art/steering demo)
- MATS / interpretability fellowships and community resources
If you want a one‑sentence summary: Goodfire is building production‑ready mechanistic interpretability tools (SAEs, probes, steering) to make large models understandable and controllable across domains—demonstrated in customer deployments and live demos—while acknowledging current limits and calling for collaborators to scale this work.
