Overview of Open Source Self-Driving with Comma AI
This episode of the Practical AI podcast features Harold (CTO at Comma.ai) discussing Comma’s hardware and OpenPilot software—an open source, retrofittable ADAS/autonomy stack. The conversation covers system architecture (device, on-device agent, and cloud training), Comma’s end-to-end approach to learning driving behavior, their world-model / simulation strategy, practical user experience, technical constraints (compute, control, RL, continual learning), and why open source matters for this space.
Key takeaways
- Comma.ai provides a retrofit device you attach to a car (behind the rearview mirror, CAN-connected) that runs OpenPilot and gives highway-level autonomy (steering + longitudinal control).
- OpenPilot is a widely used open-source autonomy stack and one of the most popular robotics projects on GitHub.
- Runtime inference and control happen entirely on-device; training and simulation run in Comma’s data center.
- Comma follows an “end-to-end” approach: models ingest raw video and output steering curvature and acceleration/gas/brake commands, with minimal explicit intermediate perception outputs.
- A critical innovation: Comma trains policy models inside a learned video-based world model (a generative/simulation model that responds to control inputs), then uses that simulator to train robust recovery behavior.
- Major unsolved technical problems highlighted: low-level controls, reinforcement learning (for tight feedback control), and continual learning (live adaptation on a per-vehicle basis).
- Open source is central to Comma’s strategy—community contributions help port new cars and increase adoption; transparency supports user ownership of devices.
System architecture — mental model
- Device (on-vehicle):
- Cameras + compute + GPS/IMU (optional).
- Runs the small policy/agent model and OpenPilot runtime processes (UI, state management, car API).
- Sends CAN messages to the car (steer, gas, brake, engagement state).
- Offline operation: works without continual cloud connectivity (only updates / data upload).
- Two-model separation:
- World model (simulator): learned generative video model that can be conditioned on actions (e.g., “turn left 10 degrees”) and must be photorealistic and responsive to inputs.
- Agent/policy: smaller model trained inside the simulator; the tiny model is what ships to devices for runtime inference.
- Training infrastructure:
- Central data center runs simulation-based training, diffusion-video generation, and supervisory processes.
- Training uses hundreds of millions of miles of human-driving data, plus simulator-generated perturbed trajectories to teach recovery from mistakes.
End-to-end vs classical modular approaches
- End-to-end (Comma’s emphasis): raw sensor (video) → neural network → driving commands (curvature, acceleration). Minimal hand-engineered intermediate representations (no explicit lane/traffic-light detection in the core policy).
- Classical pipelines: perception (segmentation, object detection, lanes) → planner/optimizer → control. These require heavy labeling, hand-engineered rules, and curated data pipelines.
- Trade-offs:
- End-to-end scales better with less human labeling; requires sophisticated simulators and robustness methods.
- Classical approaches remain popular at larger, better-funded companies that can afford extensive labeling and multi-stage systems.
World model / learned simulator (practical explanation)
- Purpose: avoid pure imitation learning’s drift problem by training the agent on perturbed scenarios and supervised recoveries.
- Requirements for a useful world model:
- Photorealism and diversity to avoid exploitable artifacts.
- Responsiveness to control inputs: generated video must change realistically when inputs/actions are applied.
- Role in training:
- Instantiate simulator on a real scene, apply perturbations (e.g., push off-center), and use the simulator (which “knows” the future) to provide supervisory signals / recovery trajectories for agent training.
User experience & product details
- Installation: plug into the car’s CAN connector near the rearview mirror, mount on the windshield—minimal setup for supported vehicles.
- Engagement: uses the car’s cruise-control engage button; device provides UI and audio feedback.
- Performance: Comma reports >50% of miles driven by OpenPilot users are driven by the system (highway-focused reliability).
- Upgrades planned: an external GPU option to raise on-device model size/compute (expected to significantly improve edge cases like nuanced green-light detection).
Practical constraints and hardware
- On-device compute is constrained by thermal/placement limits (windshield-mounted device); therefore Comma runs older, lower-power chips compared to integrated OEM systems.
- Typical compute gap: Comma’s device uses far less compute than a full OEM FSD computer (Harold estimated ~1/100th), so many gains come from efficient models and simulation-informed training.
- Adding an external GPU (e.g., under the seat) could enable 10–100x larger models and measurable capability increases (e.g., better traffic light recognition).
Major technical challenges (unsolved problems)
- Controls:
- Cars respond inconsistently across models (delays, unknown internal logic). Classical control tuning currently required; learning-based low-level control still immature.
- Per-vehicle parameters (tire stiffness, friction) often have to be learned live.
- Reinforcement learning (RL):
- Imitation learning alone fails for tight feedback-control tasks.
- Current RL approaches are not yet reliable at real-world scale for driving controls; research/engineering gaps remain (reward specification, noisy real-world dynamics).
- Continual learning:
- The agent must adapt online to changes (tire pressure, weather, wear) to maintain performance; current systems use classical optimization for adaptation rather than seamless neural continual learning.
Differences vs. other companies (Tesla, Waymo)
- Tesla / Waymo often combine large labeling/data-engineering efforts and greater on-vehicle compute budgets; they may use hybrid pipelines (some classical perception + end-to-end research).
- Comma’s differentiator: lean end-to-end focus, learned simulators, and open-source ecosystem enabling rapid car support via community contributions.
- Comma aims to be profitable and incremental—ship useful features today while iterating toward the longer-term end-to-end robotics vision.
Open source rationale and engineering choices
- OpenPilot is open source to enable community-driven car ports and ecosystem growth—practical necessity for supporting many car models.
- Language choices:
- Heavy use of Python (~66%) for rapid experimentation, development, and ML workflows.
- C/C++ used where required by performance, safety standards, or low-level car interfacing.
- Philosophy: if you own a device, you should be able to inspect and control its software; openness supports user sovereignty.
Future directions and adjacent use cases
- Short/medium-term: improve urban red-light behavior, smoother city driving, and make models more robust with increased on-device compute.
- Adjacent robotics: the same end-to-end + world-model approach could transfer to indoor navigation and other mobile robotics (vacuum, mowers, home robots) once perception, simulation, and control mature.
- Long-term vision: generalized ML agents that treat different actuators (steering vs. arm movement) similarly—enabling broader robotic actions beyond driving.
- Desire for simple, useful robotics products (dishwashers, vacuum cleaners, indoor helpers) that are open source and user-controlled.
Notable quotes / insights
“Self-driving was the most interesting applied robotics problem, period. It’s a place where you can make products that are immediately useful.” — Harold, CTO, Comma.ai
“If you want to make a robotic simulator with this approach, you need it to be accurate in terms of responding to inputs… It has to look photorealistic and respond accurately to inputs.” — Harold
Where to learn more / next steps
- OpenPilot GitHub and docs (search “comma.ai OpenPilot”).
- Comma.ai blog posts for technical write-ups (world model, releases).
- PracticalAI.fm for this episode and other AI application discussions.
Actionable takeaways for listeners
- If you’re curious about retrofit autonomy, check whether your car is supported by OpenPilot and review the installation/compatibility guidance on the OpenPilot site.
- For researchers and developers: Comma’s open-source codebase and simulation-first approach are good references for end-to-end autonomy experiments.
- If you care about device ownership and transparency, consider supporting/open-source projects in robotics and autonomy that make their code and data pipelines accessible.
Sources: episode interview with Harold (CTO, Comma.ai) on the Practical AI podcast (hosts Daniel Whitenack and Chris Benson). Note: transcript contained a few transcription errors (e.g., “Kama” → Comma.ai).
