Overview of Autonomous Vehicle Research at Waymo
This Practical AI episode features Drago Engelov, VP and Head of the AI Foundations team at Waymo. He summarizes Waymo’s product progress since 2020, explains the technical architecture and modeling approaches used for driverless cars, and outlines the roles of simulation, foundation models, and multi‑modal perception in scaling safe autonomous driving. The conversation covers deployments, safety findings, partnerships, research directions (vision‑language‑action models, world/simulation models), and open problems such as realistic simulation, out‑of‑distribution behavior, and cost‑scalable validation.
Key topics discussed
- Waymo product expansion and deployment footprint (Waymo One) since 2020
- Real-world safety statistics and transparency practices
- Vehicle hardware: sensors, compute, redundancy, electrification
- System architecture for autonomy: perception → representation → prediction → planning → control
- Modeling approaches: modular vs. end‑to‑end, on‑board vs. off‑board foundation models
- Role and challenges of simulation (validation, rare events, covariate shift)
- Multi‑modal foundation models (vision, LiDAR, radar, audio) and vision‑language‑action ideas
- Research findings on scaling laws for motion models, open‑loop vs closed‑loop behavior
- Fleet coordination, swarming, and operational logistics (e.g., charging)
- Future directions: better simulators, multi‑modal pretraining, balancing latency and reasoning
Deployments, scale, partnerships, and safety
- Waymo One launched for public use in Oct 2020 (Phoenix East Valley) and expanded to multiple metros.
- Current service (as of discussion): scaled across major metros (San Francisco, Los Angeles, Phoenix, Atlanta, Austin) with hundreds of thousands of rides per week and over 100 million autonomous miles driven.
- Reported safety improvements vs comparable human driving baselines: multiple‑times reduction in severe accidents and pedestrian injuries (Waymo reports and corroborating insurance studies referenced).
- Partnerships: integrations with Uber (Austin, Atlanta), Lyft (Nashville), DoorDash (exploring delivery) — expanding service types (taxis → highways → airports → snow-capable regions → international launches such as London).
- Practice of transparency: public safety reports, incident filings, Waymo Open Dataset and research publications.
Notable quotes/paraphrases:
- Drago: Waymo represents “one of the most advanced embodiments of physical AI” in practical use today.
- User anecdote: “This car drives much better than me,” (his mother‑in‑law) — an example of how direct experience often builds trust faster than statistics.
System architecture and hardware (high level)
- Sensors: cameras, LiDAR, radar, microphones (used for siren detection and audio cues).
- Compute: substantial on‑vehicle compute (greater than a phone), with redundancy in critical systems (steering, brakes, compute paths).
- Actuators: vehicle steering, throttle, braking — designed with redundancies and safety engineering.
- Vehicle platform choices: Waymo vehicles are electric by company design; newer generations include hardware targeted at snow and additional domain expansion.
- Software stack: perception → environment representation → behavior prediction for other agents → planning and trajectory selection → control/actuation. These components may be implemented as a mix of ML models and engineered modules.
Modeling approaches: modular, end‑to‑end, and foundation models
- Two orthogonal axes: (1) modular vs. monolithic model decomposition; (2) whether you train end‑to‑end or not. Companies combine these choices differently.
- Waymo uses ML/AI throughout; off‑board foundation models (unconstrained by on‑vehicle latency) help curate data and train the on‑car models used in simulation and production.
- Research interest: vision‑language models extended to “vision‑language‑action” tasks — tying perception+language understanding to action generation for embodied agents.
- Safety mitigations: VLMs can hallucinate — Waymo applies system‑level checks and safety harnesses to guard against these failures.
Simulation, validation, and the "two hard problems"
- Two core autonomy problems:
- Build a reliable, robust onboard driver (models + systems).
- Test and validate that driver at statistical significance (rare events, corner cases).
- Simulation is essential because many safety‑critical events occur too rarely to observe at scale on real roads.
- Simulation challenges:
- Realism: agents in simulation must respond plausibly to the vehicle’s actions (avoid unrealistic “empty spot” reactions).
- Covariate shift: a policy can lead the system into states not present in recorded data; sim must cover those states.
- Computational cost: Waymo runs millions of virtual miles per day; realistic, scalable simulators are expensive to run.
- Recent advances (cited research trends): controllable video/world models (e.g., Google’s GEnie) and pre‑trained video/text models that help build more realistic simulators; tradeoffs between realism and cost must be solved.
Research highlights and progress vs. remaining hard problems
Progress and momentum:
- Significant improvements in ML architectures, data practices, and scaling laws for motion prediction and behavior modeling.
- Successful use of motion‑LM ideas (tokenizing motion) and LLM‑inspired architectures for forecasting agent behavior.
- Off‑board foundation models and multi‑modal pretraining help accelerate capability growth and data curation.
Hard or ongoing challenges:
- Building simulators that are both realistic and computationally cost‑effective.
- Handling out‑of‑distribution detection, hallucination, and robust uncertainty estimation in VLMs and other large models.
- Ensuring closed‑loop (real-world) robustness — open‑loop improvements don’t always translate perfectly to closed‑loop safety.
- Scaling training data diversity for motion/action models (motion space is less diverse than language, so more examples are required to scale).
- Balancing latency constraints: need fast reflexive controllers on‑car plus slower, high‑level reasoners (System 1 / System 2 hybrid architectures).
Fleet coordination & “swarming”
- Current use cases: vehicles share useful information (construction zones, obstructions, slowdowns).
- Swarming and joint control become interesting at scale for operational logistics (charging scheduling, depot staging) and may one day influence traffic flow more broadly — but full traffic‑level coordination across mixed human/AV traffic is still a future area.
Future directions Drago is excited about
- Extending vision‑language models to include LiDAR, radar, and action spaces (true multi‑modal VLMs for driving).
- Building more capable, generalizable simulators using large video/world models while keeping them cost‑scalable.
- Deploying more off‑board foundation models to accelerate on‑vehicle improvements (transfer from foundation models to the production stack).
- Continued transparency, research publication, and community contributions (Waymo Research and Waymo Open Dataset).
Actionable resources & next steps
- Visit Waymo Research (Waymo.com/research) and explore the Waymo Open Dataset for papers and data supporting autonomous driving research.
- Read Waymo safety reports and published studies for transparency and statistical analyses.
- Look up recent research areas mentioned: MotionLM/motion tokenization, VLM fine‑tuning for driving (Waymo “Emma” paper referenced), scaling laws for motion models, and developments in controllable video/world models (e.g., GEnie).
- If you’re a consumer: try a Waymo ride — Drago emphasizes hands‑on experience is the fastest way for people to gain confidence.
Main takeaways
- Waymo has scaled considerably since 2020, operating in multiple cities, serving hundreds of thousands of weekly rides and logging over 100M autonomous miles.
- Autonomous driving entails a complex hardware+software system with heavy multi‑modal input, real‑time constraints, and strict safety/redundancy requirements.
- Simulation and validation (not just model creation) are central and extremely challenging due to rare events, covariate shift, and computational cost.
- Multi‑modal foundation models, video/world models for simulation, and hybrid real‑time + deliberative model architectures are promising avenues to accelerate capability and safe deployment.
- Public trust often follows direct experience; transparency and statistics matter for policy and external validation, but user rides convert skeptics more effectively than numbers alone.
