Summary of What Comes After ChatGPT? The Mother of ImageNet Predicts The Future Podcast Episode by a16z Podcast

Overview of What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

This Layton Space / a16z episode features Fei‑Fei Li and Justin Johnson (co‑founders of WorldLabs) in conversation about the next frontier beyond large language models: spatial intelligence and generative world models. They introduce Marble, WorldLabs’ public demo/model that generates explorable 3D worlds from multimodal inputs, and discuss technical, product, and ecosystem implications for building and deploying large-scale spatial models.

Participants

Fei‑Fei Li — Stanford professor, co‑director of the Stanford Institute for Human‑Centered AI, creator of ImageNet, co‑founder of WorldLabs.
Justin Johnson — Fei‑Fei’s former PhD student, ex‑professor (UMich), ex‑Meta Research, co‑founder of WorldLabs.
Hosts from Layton Space / a16z.

Main topics covered

The motivation for moving beyond language models toward spatial/world models.
History: ImageNet/AlexNet era, early vision+language work (image captioning, dense captioning).
Marble: a practical, public step toward spatial intelligence—what it does and how it’s built.
Technical building blocks and architectural ideas (Gaussian splats, transformers, physics integration).
Compute, data, and the ecosystem tradeoffs between open academic research and industry productization.
Use cases (VFX, gaming, interior design, simulation for robotics) and current limitations.
Hiring and research directions WorldLabs is prioritizing.

What Marble (WorldLabs) is — concise description

Marble is a generative model/system that produces interactive, explorable 3D worlds from multimodal inputs (text, single or multiple images).
Users can edit scenes interactively (change colors, remove objects, reposition things), record camera motion, and export assets.
Native representation: Gaussian splats (many semitransparent particles with 3D position/orientation) — chosen for efficient real‑time rendering on consumer devices (phones, VR).
Designed to be both a research step toward spatial intelligence and a usable product for creative industries.

Technical insights and architecture

Transformers as set models: A transformer is permutation‑equivariant by default; order/sequence is injected via positional encodings. Thus transformers are naturally models of sets, not strictly sequences.
Atomic units in 3D world generation:
- Marble uses Gaussian splats now (good for realtime rendering and camera control).
- Other approaches: frame‑by‑frame generation, tokens representing chunks of 3D, or mesh/voxel outputs.
Physics and dynamics:
- Current models primarily fit visual patterns; they may not learn causal physical laws natively.
- Options to add dynamics: attach physical properties to splats (mass, springs), run classical physics engines on predicted properties, or regenerate scenes after interactions (trading compute for generality).
Hardware and primitives:
- Modern NN stacks exploit matrix multiplication primitives because of GPU design (the “hardware lottery”).
- Future hardware scaling could motivate different primitives/architectures beyond matrix‑multiply centric designs.
Compute & data: Scaling compute (orders of magnitude more than AlexNet era) and richer multimodal data make spatial models more feasible today.

Use cases and product/market fit

Immediate/early product fit:
- Creative industries: VFX, film backdrops, gaming assets, camera planning/directing (precise camera control, recording).
- Design and architecture: interior remodeling mockups, layout exploration; early beta users are already experimenting here.
Simulation & robotics:
- Marble can produce synthetic environments for embodied agent training — a middle ground between curated assets and limited real‑world robotics data.
Limitations: resolution & density constrained by target device rendering budgets (mobile/VR). High fidelity zoom‑in requires more splats or more expensive rendering hardware.

Research, ecosystem, and policy perspectives

Open science vs. industry:
- Fei‑Fei emphasizes the importance of open datasets and benchmarks (examples: her Stanford “Behavior” dataset for robotics simulation) alongside industry product work.
- Main concern: under‑resourcing of academia to explore “wacky,” long‑horizon ideas (theoretical, hardware, new learning paradigms).
Suggested research directions:
- New architectures that align with future hardware/topologies (beyond monolithic GPU assumptions).
- Methods to integrate causal/dynamic reasoning into world models.
- Better simulation and synthetic data generation for embodied learning.
Policy note: Fei‑Fei’s involvement with proposals like a national AI compute/data resource (NAIR) to help level public research access to compute/data.

Notable quotes / pithy insights

“Transformers are actually models of sets, not sequence models — positional encodings are what make them sequential.” — Justin Johnson / Fei‑Fei discussion.
“Spatial intelligence is the capability that allows you to reason, understand, move, and interact in space.” — Fei‑Fei Li.
“If a physics engine was perfect, we would have no need to build these models — classical engines don’t cover the generality we want.” — on why learned world models matter.

Limitations and open questions highlighted

Do learned models actually “understand” physics (causal laws) or only fit visual patterns? How far can emergent capabilities scale to causal reasoning?
Tradeoffs between regenerating scenes for interactions (compute heavy) versus embedding simulatable physical properties (engineering complexity).
How to scale splat density and rendering fidelity while retaining realtime performance on consumer devices?
Interplay of language and spatial intelligence: complementary modalities; language remains important as an interface but may be lossy compared to pixel/spatial signals.

Calls to action (for researchers, engineers, contributors)

WorldLabs is hiring across research (large-scale world models), engineering (training, inference, systems, product), and go‑to‑market/product roles.
Areas where contributions are helpful:
- Architectures/algorithms for spatial/dynamic reasoning.
- Systems engineering for large model training and efficient cross‑device inference.
- Synthetic data and simulation pipelines for embodied agent training.
- Human‑centered product design and tooling for creative/industrial users.

Key takeaways

Spatial intelligence is a distinct and complementary frontier to language models; it focuses on embodied, 3D/4D understanding and interaction.
Marble is a public, early step: multimodal inputs → interactive, explorable 3D scenes built from Gaussian splats, aimed at creative and simulation use cases today while serving as a research testbed.
Transformers, compute scaling, and available data make world models increasingly tractable — but integrating causality/physics and achieving high‑fidelity interactive dynamics remain active challenges.
The ecosystem needs a mix of open academic work and industry productization; resourcing academia and building public compute/data infrastructure remain important for healthy long‑term progress.

If you want to explore Marble or see technical details, WorldLabs’ Marble demo and the Marble technical blog are the public starting points mentioned in the conversation.

Summary of What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

a16z Podcastby Andreessen Horowitz