Overview of World Models & General Intuition (swyx + Alessio)
This episode covers a visit to General Intuition (GI), a spin-out from the 10-year-old game-clip platform Metal. GI is building large-scale vision-based world models and imitation-learned agents that predict actions directly from pixel frames. The conversation covers GI’s dataset and privacy design, live demo highlights (real-time, human-like & sometimes superhuman behavior), research lineage (Diamond / SEMA / Genie), engineering and product strategy, customers and business model, fundraising (Khosla-led $134M seed), and a 2030 vision for spatial-temporal foundation models.
Key takeaways
- Metal collected ~3.8 billion curated game clips via a retroactive clipping recorder (12M users), producing a unique high-quality dataset of “peak” human behavior and rare/adversarial events.
- GI trains vision→action models (pure imitation learning) that run in real time, predict keyboard/mouse actions from pixels, and have short-term memory (~4s).
- They transfer from games → realistic games → real-world video, enabling training on any internet video as free additional data (with caveats).
- GI turned down a reported $500M offer from OpenAI and raised a $134M seed led by Khosla Ventures (Vinod Khosla’s largest single seed bet since OpenAI).
- Business model: API / custom models / distilled models for customers (they do not sell raw data). Initial customers: game developers, game engines, and robotics/manufacturing teams that can use game-controller-like inputs.
- Long-term vision: be the “gold standard of intelligence” for spatial-temporal agents; target 80% of atoms→atoms AI interactions and a much larger simulation market by 2030.
Demo & technical highlights
- Agent architecture: vision-only policy that inputs frames and outputs actions (no game-state access, no RL in base model).
- Memory/temporal behavior: ~4-second memory window shown in demos; models can get “unstuck,” maintain position under partial observability (e.g., smoke), and handle rapid camera/mouse dynamics.
- Behavior characteristics: trained on highlight clips, so baseline is “peak human” — models can replicate superhuman moves seen in highlights.
- Transfer capability: GI can label and predict actions for arbitrary videos (yellow = ground truth, purple = model prediction) and use that for additional training.
- World models: pre-trained from scratch and fine-tuned on open video models to capture physical effects (e.g., camera shake), mouse sensitivity, and multi-view consistency.
- Model scaling & compression: distillation to tiny, real-time models for cost/parameter efficiency; smaller models make more mistakes but still operate in real time.
- Notable emergent behavior: spatial reasoning (hiding, reloading timing), consistent position tracking out-of-view, and physical plausibility even for effects that exist only in video.
Dataset, privacy and labeling approach
- Metal’s product design: retroactive clipper (always-on recorder in memory; save interesting moments on demand), which yields highlights and reduces needless boring data.
- Scale & uniqueness: ~3.8B clips, rich overlay signals (controller overlays, HUDs) and action-like signals enabling high-quality action labeling at scale.
- Privacy-first design: they avoid storing raw per-user keystrokes as-is. Instead, Metal maps in-game visual inputs to actions and aggregates/labels action tokens so individual users’ raw inputs are not recoverable.
- Labeling effort: thousands of humans labeled the action space across many games; GI converts those labels into trainable action tokens.
- Why highlights matter: clips select out-of-distribution/high-signal moments (negative events, exceptional skill), which are highly valuable for training robust agents.
Research context & comparisons
- Lineage: GI draws from Diamond, SEMA (SEMA/SEMA2), and Genie research; recruited contributors from Diamond/GAIA.
- Contrast with other approaches:
- Frame-based video → GI’s pipeline uses direct frame prediction and action-token prediction (aligned with Metal data format).
- SPAT / simulator-first approaches (e.g., Fei-Fei Li / World Labs): GI views those as useful but not yet interactive enough; SPAT may be more verifiable but is higher DOF and potentially harder to scale.
- LLMs: GI sees LLMs as complementary/orchestrators (text is a compressed modality). Their argument: for spatial-temporal generalization, a spatial backbone is essential; LLMs will remain useful as controllers/orchestrators.
- Practical learning: simulation compute grows quickly with agents, DOF, and action information content — video transfer is an efficient bet for highly stochastic, hard-to-simulate domains.
Business model, customers & use cases
- Primary offering: API that ingests frames and returns actions / custom policy models; can also distill models for customers to deploy.
- Customers: large game studios, game engines (replace behavior trees / scripted controllers with learned policies), robotics/manufacturing teams where robots accept controller-like inputs.
- Immediate use cases:
- Better bots to keep games populated (player liquidity, night-time play).
- Simulation & training data (e.g., replaying negative events for safety-focused model fine-tuning).
- Making millions/billions of clips “playable” in a world-model sense (move from imitation → RL).
- They do not sell raw Metal data; their monetization centers on models, APIs, and custom solutions integrated into customers’ stacks.
Founder story, team & partnerships
- Founder/CEO (Pim): built Metal, grew it into a large game clip social product; self-taught engineer who supplemented with deep learning coursework to lead GI.
- Team: co-founders include deep ML researchers; hires include core contributors from Diamond/GAIA (e.g., Anthony Hu). Team mixes infrastructure strength (video/GPU/transcoding) with research talent.
- Funding & partners: $134M seed led by Khosla Ventures; Khosla’s investment driven by a deep technical vetting of the vision. GI partnered with QTAI (open science lab in Paris) to enable open research collaborations.
Vision: where GI wants to be by 2030
- Goal: become the “gold standard of intelligence” for spatial-temporal sequence modeling.
- Target metrics: aim to be responsible for ~80% of atoms→atoms interactions handled by AI models and to power a significantly larger simulation market (100x) initially.
- North-star use cases: simulation-first scientific problems (e.g., virtual biology, factory-floor training), then transfer to real-world robotics where possible via post-training using their foundation models.
Notable quotes & insights
- “Metal is the episodic memory of humanity in simulation.” — positioning clips as curated, high-signal episodic data.
- “We want to be the gold standard of intelligence — any sequence long enough is fundamentally spatial and temporal.” — the core thesis behind GI’s focus.
- Practical advice to other data owners: “You can’t value data until you model it yourself — build miles with your data to see unique capabilities.”
Actionable recommendations (for different audiences)
- For researchers: consider frame-based imitation + world-model pipelines when you have abundant interactive video; test transfer to real-world video and partial-observability settings.
- For data-rich founders: do lightweight experiments (train small models) to estimate the unique capabilities of your dataset before selling/licensing; aim for equity or strong collaboration terms in any large data deal.
- For game developers & engine teams: evaluate GI-like models to replace deterministic NPC logic and improve player engagement in low-liquidity hours; consider how controller/overlay data could be used for model fine-tuning.
- For academics / universities: GI is open to collaborations — contact them for research access if your project aligns (negative-event prediction, safety-focused simulation).
Final note
GI’s approach is an end-to-end bet on action-token prediction from pixel frames, leveraging a rare dataset of curated highlights to accelerate world-model capabilities. They position themselves as a foundation-model provider for spatial-temporal agents with a roadmap from imitation learning → distilled real-time agents → RL and broader simulation/robotics applications.
