Sergey Levine - Building LLMs for the Physical World - [Invest Like the Best, EP.465]

Summary of Sergey Levine - Building LLMs for the Physical World - [Invest Like the Best, EP.465]

by Colossus | Investing & Business Podcasts

1h 6mMarch 31, 2026

Overview of Sergey Levine — Building LLMs for the Physical World (Invest Like the Best, Ep.465)

This episode is a deep conversation between Patrick O’Shaughnessy and Sergey Levine (co-founder/researcher at Physical Intelligence) about building "robotic foundation models" — large, general-purpose models that control physical devices across many tasks and embodiments. Levine explains why a general, learning-first approach (leveraging web-scale multimodal pretraining plus robot data and reinforcement learning) is preferable to narrowly engineered robots, the technical progress to date, major remaining challenges (common sense, long-tail robustness, safety/acceptance), business and deployment implications, and what researchers and entrepreneurs should focus on next.

Key takeaways

  • Physical intelligence = building foundation models that can control many kinds of embodied systems to perform many tasks, analogous to LLMs for language.
  • The preferred path: large-scale multimodal pretraining (text → images → robot data) + chain-of-thought-style mid-level reasoning + reinforcement learning for continual improvement.
  • Generality matters because it lets new tasks be learned with far less data per task and enables rapid experimentation and form‑factor innovation.
  • Major technical bottlenecks remain around mid-level reasoning / common sense, long-tail edge cases, and safe, trustable deployment in human environments.
  • Practical progress is real: Levine reports surprising dexterity gains, cross-embodiment transfer, and that many everyday tasks can already be handled by their models.
  • Important non-technical factors: human acceptance, deployment strategy (teleoperation vs autonomous data collection vs shared autonomy), and operations for continuous data collection and improvement.

What Sergey Levine means by "Physical Intelligence"

  • A foundation-model approach for embodied agents: models that can take language and perception inputs and output actions for robots across many bodies and tasks.
  • Goal: replicate the human ability to generalize physical skills (i.e., transfer of physical intuition, compositional skill use) so new tasks are learned quickly and robustly.
  • Hypothesis: solving in full generality may be easier and more powerful than building many narrowly engineered specialists.

Technical approach and methods

  • Vision-Language-Action Models:

    • Pipeline: pretrain on text, adapt with image data (web-scale multimodal), then adapt to robot control with diverse robot data.
    • These models combine web knowledge with embodied control capabilities.
  • Chain-of-thought / mid-level reasoning:

    • Before acting, the model generates intermediate semantic steps ("think" about what to do), which unlocks web-derived common-sense knowledge and improves handling of unusual situations.
    • Supervising models with high-level semantic labels (not just low-level motion data) improves generalization.
  • Reinforcement learning:

    • Used for continual improvement and to achieve speed/robustness beyond initial imitation/demonstration data (e.g., espresso demo improved via practice).
  • Sensors and embodiment:

    • Surprisingly minimal sensors can work: example platform had 3 cameras (wrist + base), no force or touch sensors; visual cues can partially substitute for tactile input.
    • Low-cost hardware is now viable; cheaper arms enable experimentation and scale.
  • Data strategy:

    • No single known threshold for how much robot data is "enough." Key is to reach a level of utility that allows deployed systems to collect more open-world data (bootstrap / activation energy).
    • Debate: real-data-heavy approaches (manipulation) vs simulation-heavy approaches (humanoids/locomotion). Both may remain relevant or be synthesized.

Progress, surprises, and empirical findings

  • Surprising dexterity: Levine was surprised how quickly general models improved in dexterous tasks without bespoke methods.
  • Cross-embodiment transfer: same models have been adapted to multi-fingered hands and different DOF robots without changing model architecture.
  • Mid-level supervision matters: adding semantic coaching labels helps generalize to new kitchens/environments without more low-level teleoperation.
  • Robot Olympics / task onboarding: many everyday tasks (opening doors, washing grease pans, folding laundry, etc.) were solvable as tests of their onboarding pipeline — exceptions were hardware limits (e.g., gripper size).

Major challenges and open problems

  • Long-tail / edge-case robustness: physical environments are open-ended; robots must handle rare, unexpected situations sensibly.
  • Common-sense grounding: multimodal LLMs have world knowledge but need appropriate grounding and context for embodied action.
  • Safety and human acceptance: even if robots learn, will people accept imperfect behavior in homes (children, fragile objects)? Some domains require higher certainty.
  • Representation for mid-level reasoning: what internal representations (spatial, semantic, hybrid) are optimal for long-horizon embodied tasks?
  • Form-factor & manufacturing risk: even if the AI works, making the right hardware at scale and integrating it into business is nontrivial.
  • Moral/ethical/social complexity: tasks involving care (child/elder assistance) are especially hard and ethically sensitive.

Applications and implications

  • Wide range of potential endpoints: household chores, hospitality (hotel/restaurant), industrial construction, surgical/medical micro-robots, swarms (e.g., quadcopters), and more.
  • Foundation models lower the barrier to robot product experimentation — analogous to how PCs/LLMs enabled software/coding innovation.
  • Early deployment likely to be hybrid: shared autonomy, teleoperation + coaching, or constrained environments (hotels, restaurants) where data collection is feasible and acceptance manageable.
  • Jobs & productivity: more likely to augment humans (increase productivity) than immediate wholesale replacement; expect mixed complementarities similar to coding tools.

Business, deployment & product strategy insights

  • Activation energy is key: get to the point where deployed robots are useful and can collect more data.
  • Two competing data-collection models:
    • Teleoperation/demonstrations (lots of labeled human data)
    • Autonomous/self-supervised RL (robots gather huge amounts of experience)
    • The right mix (90/10 or 10/90) is still an open question and changes how companies should prepare.
  • Entrepreneurs should focus on clear economics of labor, data ops, and the right domain constraints that enable early, safe deployments.
  • Manufacturing and scale matter, but they become tractable once software uncertainty is reduced by a general foundation model.

Research culture, how progress happens, and what makes a good researcher

  • Progress is iterative: many experiments, failures, and demos contribute; breakthroughs often come from unexpected pivots (e.g., coaching labels helping generalization).
  • Key research judgment: when to persist vs when to pivot — timing and persistence differentiate great researchers.
  • Labs that empower pet experiments and rapid prototyping (culture of experimentation) can create outsized breakthroughs.
  • Demos are valuable if they honestly illustrate capabilities and constraints; they help set imagination and research targets.

Notable quotes and succinct insights

  • "Physical intelligence is one problem, not many different problems. The foundation model should figure out how to manipulate whatever body it's controlling."
  • "The bottleneck had shifted from low-level action to a middle level — interpreting the scene and selecting the correct next step — which can be supervised with language."
  • "If you get a system useful enough that it can go into the world and gather more data itself, you don't need to know exactly how much data is 'enough' today."
  • "Changing a child's diaper will be really, really hard — it's the pinnacle of Moravec's paradox."

Actionable recommendations

For entrepreneurs and product teams

  • Start by identifying narrow, high-value domains that allow controlled deployments and data collection (restaurants, hotels, warehouses).
  • Design for a data-collection plan (teleop, shared autonomy, or autonomous RL) aligned with your thesis about how the tech will progress.
  • Consider building around a general model (or partnering with providers) rather than hard-engineering every task.

For investors

  • Look for teams with explicit plans for continuous data collection and improvement (operations + learning loop).
  • Favor companies that prioritize generality and "improvability" (ability to get better with data/autonomy).
  • Pay attention to hardware cost trends and partnerships that reduce manufacturing risk.

For researchers

  • Focus on mid-level representations and methods that ground LLM knowledge into embodied, long-horizon control.
  • Build pipelines that make onboarding new tasks and form factors inexpensive (so general systems can be stress-tested widely).
  • Embrace experimentation and be mindful of timing — both persistence and willingness to pivot matter.

Where to follow developments

  • Read research papers and check conference proceedings (ICRA, ICLR, NeurIPS, CoRL) for technical details.
  • Watch demos and videos critically — dig into associated papers/code to understand real capabilities and constraints.
  • Follow labs and companies that publish models or toolkits and collaborate (e.g., Physical Intelligence’s public releases, Boston Dynamics demos, academic robotics labs).

Final assessment (Levine’s perspective)

  • Optimistic about technical progress and surprised by faster-than-expected gains in dexterity and cross-embodiment transfer.
  • Timeline uncertain: the key unknown is when systems cross the activation threshold to be broadly useful and self-improving.
  • The biggest near-term scientific challenge is mid-level reasoning / grounding of common-sense knowledge into physical actions — solving this unlocks much of the rest.