Summary of After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs Podcast Episode by Latent Space: The AI Engineer Podcast

Overview of After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson (World Labs)

This episode explores "what comes after LLMs": the push from language-centric AI to models that understand and generate 3D, interactive worlds. Fei‑Fei Li and Justin Johnson (co‑founders of World Labs) discuss the origins of their collaboration, the product and research goals behind Marble (their public 3D-world model), technical design choices (Gaussian splats, multi‑modal inputs, interactivity), and broader scientific, hardware and ecosystem challenges for building true spatial intelligence.

Key takeaways

Spatial intelligence = the capability to reason about, move through, perceive and act within 3D space. It complements (not replaces) language intelligence.
Marble is World Labs’ first shipped product: a generative model for 3D worlds that accepts multimodal inputs (text, images), supports interactive editing, recording camera motion, and exports assets for creative workflows.
The “next big leap” in AI is driven by scaling compute + richer spatial/visual data, but also requires rethinking data structures, representations, and potentially hardware primitives.
Practical work will mix open science and productized, closed efforts. Academia remains crucial but under‑resourced relative to industry.
Important research challenges: embedding physics/dynamics, enabling interactive/embodied agents, higher‑fidelity renderings, and better generalization (causal/world‑laws vs pattern matching).

Origins & context

Fei‑Fei Li and Justin Johnson have a long advisor‑student relationship (Justin joined Fei‑Fei’s Stanford lab around the AlexNet/ImageNet era) and reunited to pursue spatial world models.
Their trajectory: image captioning and dense captioning work evolved into interest in 3D/generative models and interactive systems—leading to World Labs and Marble.
They emphasize the historical pattern: deep learning advances follow compute scaling, and spatial models are a natural next sink for that compute.

What is Marble (concise)

Core idea

A generative model of 3D worlds: input text/images → generates a 3D scene; supports edits, camera motion recording, and export.

Capabilities highlighted

Multimodal inputs (text, single/multiple images).
Interactive editing (change objects/colors/layouts, re‑generate locally).
Precise camera control / scene recording (enables director‑style framing).
Real‑world product fit: early users in gaming, VFX, film, interior design and synthetic data for robotics.

Design intentions

Balance research toward a larger spatial intelligence vision with a usable product that people can apply today.

Technical foundations & data structures

Current native atomic unit: Gaussian splats (semi‑transparent particles with position/orientation). Advantages: real‑time rendering, interactive camera control, mobile/VR friendliness.
Alternative architectures exist or are possible:
- Frame‑by‑frame generation (RTFM approach).
- Tokenized or higher‑level 3D tokens (future representations).
- Physics‑augmented splats: attach mass, springs, or other properties to splats to enable dynamics.
Tradeoffs:
- Splats are efficient for real‑time playback on constrained devices but limit extreme zoom/resolution unless more compute/resources are available.
- Full regeneration of scenes for dynamics is general but computationally heavy; hybrid approaches (learned models + physics engines) are viable.

Spatial intelligence: why it matters

Spatial intelligence handles embodied, high‑bandwidth information (vision + action) that language alone cannot capture efficiently.
Human cognition evolved a large portion of its intelligence for perception and spatial interaction over millions of years—language is a later, lower‑bandwidth abstraction.
Combining language and spatial models is key: people want to interact with systems via language, but high‑fidelity models must understand 3D structure, dynamics and affordances.

Research challenges & directions

Generalization vs causal understanding:
- Current generative models can produce plausible scenes without learning causal physical laws; for physically critical use cases (architecture, robotics), explicit physics or causal modeling is needed.
- Two paradigms: train models end‑to‑end and hope internal causal structure emerges, or explicitly inject physics (simulations, annotated properties).
Data & compute:
- Spatial models require more diverse, higher‑bandwidth data (3D scans, multi‑view sequences, simulation traces).
- Scaling compute (and matching data) remains critical—hardware constraints and the “hardware lottery” matter.
Hardware & primitives:
- Transformers and current deep nets assume matrix multiply primitives tuned to GPUs. Future large distributed systems might call for different computation primitives.
Representation design:
- Choosing atomic units (splats, meshes, voxels, tokens, frames) affects latency, editability, fidelity and downstream use (rendering vs simulation).
Ecosystem & openness:
- Open datasets and benchmarks still matter (e.g., Stanford’s Behavior dataset for robotics).
- Academia is under‑resourced; need for public compute/data repositories and diverse funding to sustain exploratory and “wacky” research.
Multimodality & interfaces:
- Integrating language, vision and dynamics is both practical (user interactions) and scientifically interesting (how representations map across modalities).

Notable insights & quotes

“Spatial intelligence is the capability that allows you to reason, understand, move, and interact in space.” — Fei‑Fei Li
“Marble is a generative model of 3D worlds.” — Justin Johnson
“The whole history of deep learning is in some sense the history of scaling up compute.” — Justin Johnson
“A transformer is natively a model of sets.” — Justin Johnson (positional encodings inject order)
Practical point: Marble uses Gaussian splats so scenes can render in real time on mobile and VR devices.

Practical uses & early examples

Creative industries: game world prototyping, VFX, film backgrounds, director‑style camera control.
Design and architecture: interior design and remodel visualizations (e.g., kitchen/countertop swaps).
Robotics and embodied AI: generating synthetic environments and state distributions for training agents—addresses “data starvation.”
Rapid prototyping: generate scenes from photos/text and iteratively edit to refine layout, materials, lighting.

Calls to action / who they’re hiring for World Labs

World Labs seeks:
- Deep researchers in world/3D models, physics/dynamics, representations.
- Engineers (training, inference systems, product engineering).
- Product and GTM talent (design, UX, business).
Try Marble: enable advanced mode for finer editing (UI tip from hosts). Read World Labs’ Marble tech blog for technical details.

Practical recommendations for listeners

Try Marble (click “advanced mode” for fine edits) to explore 3D editing, camera recording, and export options.
If you’re a researcher: prioritize work on representations for 3D dynamics, hybrid learned+simulation pipelines, and datasets that provide controllable, high‑fidelity multi‑view data.
If you’re in academia or policy: advocate for public compute/data resources (national/institutional repositories) to close the resource gap with industry.
If you build products: consider Marble and similar tools for rapid prototyping in design, VFX, and synthetic data generation.

Final note

The episode positions spatial intelligence as the next major frontier in AI: it requires combining scalable compute, new data modalities, and novel architectural and systems thinking. Marble is presented as a pragmatic first step—both a public product and a research platform—toward richer, interactive world models.

Summary of After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Latent Space: The AI Engineer Podcastby swyx + Alessio