Even the chip makers are making LLMs

Summary of Even the chip makers are making LLMs

by The Stack Overflow Podcast

26mMarch 10, 2026

Overview of The Stack Overflow Podcast — "Even the chip makers are making LLMs"

This episode features Keri Britsky, VP of Generative AI at NVIDIA, discussing why and how a GPU/chip company is building large language models (LLMs) and the tight hardware–software co-design that enables modern AI systems. They cover NVIDIA’s NemoTron model family, numeric-precision choices, inference and memory tradeoffs, new model architectures (hybrids and Mixture-of-Experts), disaggregated serving, context/memory engineering, and the benefits of releasing models, weights, data, and tooling as open source.

Key topics covered

  • Why NVIDIA (a chip maker) develops LLMs: to understand and optimize real workloads via extreme hardware–software co-design.
  • NemoTron model family: Nano, Super, Ultra, plus vision/speech/embedding models in the same family.
  • Numeric precision evolution (FP16, FP8, FP4) and benefits of training in reduced precision vs post-training quantization.
  • Hybrid model architectures (state-space + transformer) and Mixture-of-Experts for token efficiency.
  • System-level concerns: disaggregated serving, communication layers, context memory engines and million-token contexts.
  • Open-source release: architectures, weights, datasets, libraries, and gym environments — and the ecosystem benefits.
  • Roadmap timing (Nano v3, Super, Ultra around GTC) and plans to open up contribution workflows.

Technical highlights and takeaways

  • Extreme co-design: NVIDIA runs models internally to feed back requirements into GPU, networking, and storage design — tight, engineer-to-engineer iterations to inform future hardware (e.g., the context memory engine announced at CES).
  • Numeric precision:
    • Historically trained at FP16 and run inference at FP8 on some platforms.
    • Newer GPUs/architectures (e.g., Blackwell) support FP4-like precisions; training directly at lower precision can preserve accuracy better than quantizing a high-precision trained model (post-training quantization can cost ~1–2% accuracy).
    • Lower precision reduces model memory footprint substantially (roughly up to half when moving from FP16 to int8-like representations), enabling more efficient multi-GPU/multi-node deployments and reduced latency.
  • Model architecture innovations:
    • Hybrid models combining state-space modules (efficient sequence processing) with transformers improve token efficiency and inference scaling (state-space helps reduce quadratic inference costs in dense transformers).
    • Mixture-of-Experts (MoE) recipes are used to scale capacity while managing compute cost.
  • Serving and systems:
    • Disaggregated serving approaches (e.g., splitting pre-fill, decode, etc., across different GPU SKUs) improve utilization and efficiency for large inference deployments.
    • Context memory engineering and storage hierarchies are critical for very long contexts (million-token contexts) and agentic systems that maintain/recall memories.
  • Agents and systems-of-models:
    • Real-world, agentic AI uses multiple specialized models (ASR, TTS, embeddings, re-rankers, domain experts). GPUs are viewed as general-purpose building blocks rather than per-model-specialized chips.
    • Memory, caching, retrieval, and agent orchestration become central system-design problems (akin to microservices / object-oriented patterns applied to autonomous agents).

Open-source & ecosystem impacts

  • NVIDIA released full open-source stacks for NemoTron: model architectures, model weights, the training datasets, libraries, and gym environments.
  • Benefits observed:
    • Enterprises can audit data and reduce legal/Trust issues (helps with liability concerns about unknown training data).
    • Partners and domain specialists can fine-tune, generate domain-specific data, and build verifiers/gym environments (example: ServiceNow built a domain model and gym envs).
    • Community validation and red-teaming accelerate research, bug discovery, and architecture validation.
    • Model builders value datasets and gym environments; vendors also engage around GPU optimization and scaling best practices.
  • Future: NVIDIA plans to further open contributions (eventual PR/architecture contribution workflows) so external researchers can propose changes into their model plan-of-record.

Roadmap & practical next steps

  • NemoTron releases timeline (as discussed):
    • Nano v3 — released in December
    • Super — rolling out in early February
    • Ultra — planned for around April (post-GTC)
  • Where to try and learn more:
    • Hugging Face (for model hosting and community)
    • NVIDIA Developer pages
    • NVIDIA GTC events (attend to meet researchers and developers)
  • Short-term suggestion for practitioners:
    • Use released datasets and gym environments as a trustworthy bootstrap for domain fine-tuning.
    • Consider training at lower precision when possible to retain accuracy and reduce memory demands, rather than relying only on post-training quantization.

Notable quotes and soundbites

  • “We’re not just a chip company, we’re a full stack company.”
  • “Extreme co-design” = daily, close, engineer-to-engineer feedback loops so hardware architects get actionable model-driven requirements early in the plan-of-record.
  • “One model does not rule them all” — modern AI systems are agentic stacks of multiple specialized models.
  • “We believe this is a new type of software development platform” — models and libraries should follow a software lifecycle (updates, bug fixes, PRs).

Short list of action items / links mentioned

  • Intrinsic + partners robotics competition: intrinsic.ai/stack — register by April 17 (prize pool $180,000).
  • Explore NemoTron and related resources on Hugging Face and NVIDIA Developer pages.
  • Attend NVIDIA GTC (March/April timeframe) to see demos and talk to researchers.

Episode extras

  • Stack Overflow shoutout: “Fourth Iceman” earned a populist badge for an answer on centering text in Pygame.

If you want a distilled one-liner summary: NVIDIA builds LLMs to drive hardware–software co-design, releases a full open-source model stack (NemoTron) to accelerate domain specialization and system-level innovation, and focuses heavily on precision, memory, and serving architectures to scale real-world, agentic AI systems.