Overview of Do you have what it takes to run AI in production?
In this live Stack Overflow Podcast episode from HumanX, host Ryan Donovan talks with Peter Selenke, CTO and co-founder of CoreWeave, about what it really takes to run AI workloads in production. The conversation focuses on why AI infrastructure looks very different from traditional cloud architecture, with a particular emphasis on networking, memory bandwidth, scheduling, power delivery, and the operational complexity of large-scale GPU clusters.
Key takeaways
AI infrastructure is not traditional cloud infrastructure
- Traditional hyperscalers are built around abstraction, redundancy, and general-purpose web workloads.
- AI training and inference often require tight synchronization across many GPUs, where a single failure can break the whole job.
- Instead of designing systems that “never fail,” AI infrastructure must be built to:
- detect failures quickly,
- isolate the broken component,
- restart without losing progress.
Networking is the biggest bottleneck
- As GPU compute gets faster, the network must scale with it.
- The challenge is not just bandwidth, but also:
- cable count,
- optics vs. copper trade-offs,
- heat,
- operational complexity,
- and synchronization overhead.
- At smaller scales, electrical interconnects like NVIDIA’s NVLink help within a rack or “scale-up domain.”
- At larger scales, systems still rely on optical networking and more traditional scale-out architecture.
- Selenke notes that the network is often the hardest part because it constrains everything else.
Memory bandwidth matters more than memory size
- AI workloads often keep large models in memory and repeatedly access them, making memory bandwidth a major bottleneck.
- Techniques like mixture of experts help by avoiding activation of every parameter for every request.
- Scaling memory often shifts pressure back to the network, so the bottleneck simply moves rather than disappears.
Bigger GPU clusters help, but only when the use case justifies it
- Large clusters can speed up pre-training and some inference workloads.
- But a 100,000-GPU cluster is not automatically the right answer:
- reliability becomes much harder,
- scheduling gets complex,
- utilization becomes critical.
- The real question is often whether a team is using those GPUs efficiently, not just whether they can obtain them.
Scheduling and utilization are major operational challenges
- A key problem in AI infrastructure is balancing:
- availability,
- cost,
- preemption,
- priorities,
- and utilization.
- The episode references:
- Slurm as the legacy HPC scheduler,
- newer cloud-native scheduler efforts in the Kubernetes ecosystem.
- CoreWeave is building tooling around scheduling and observability so customers can focus on model work rather than infrastructure management.
The industry is shifting toward flexible, heterogeneous data centers
- AI workloads are no longer just “buy the latest GPU and connect a bunch of them.”
- Data centers now need to support a mix of:
- pre-training,
- RL,
- inference,
- agent workloads,
- CPU-heavy evaluation jobs.
- CoreWeave designs for late binding and flexibility:
- liquid cooling where useful,
- adaptable server mixes,
- infrastructure that can change as hardware and workload patterns evolve.
Supply chain and power are still constrained, but in different ways
- The biggest bottleneck is not necessarily raw power generation.
- The real challenge is getting power into usable form:
- substations,
- transformers,
- low-voltage delivery,
- HVAC,
- generators,
- and the skilled labor to install and maintain all of it.
- Supply chain pressure shifts over time:
- GPUs,
- then chips and shells,
- then NAND,
- then DRAM and SSDs.
- There is also an active secondary market for components, though CoreWeave mostly buys directly from manufacturers at its scale.
Advice for developers building AI products
Don’t overcomplicate your infrastructure too early
- Start simple and use existing tools whenever possible.
- Avoid trying to build:
- massive clusters on day one,
- custom inference stacks before you have users,
- or overly complex serving architectures too early.
- The space evolves so fast that today’s architecture may need rewriting within months anyway.
Focus on your model and product, not infrastructure mechanics
- If you don’t need to build your own inference stack, use a provider or platform that already handles the complexity.
- The goal should be to build the model or application, not fight the underlying infrastructure.
Be intentional with AI coding tools
- AI assistants like coding copilots are useful productivity tools, but engineers should still understand every line they ship.
- Selenke stresses that teams should not become over-reliant on AI-generated code without understanding the system architecture.
Vet infrastructure providers carefully
- Because AI infrastructure is crowded and supply-constrained, it’s important to verify:
- security,
- reliability,
- data handling practices,
- and provider expertise.
- This is especially important when sensitive or personal data is involved.
Notable themes and insights
“The problem never really goes away”
The episode repeatedly returns to a central idea: in AI infrastructure, solving one bottleneck usually reveals the next one. Network, memory, power, supply chain, and scheduling all compete to become the limiting factor.
Flexibility beats rigid planning
Because models, frameworks, and workload patterns change so quickly, infrastructure must be adaptable. Selenke argues that rigid, over-optimized systems are more likely to become obsolete quickly.
AI infrastructure is a full-stack engineering challenge
Running AI in production requires much more than GPUs:
- networking,
- cooling,
- power delivery,
- scheduling,
- observability,
- supply chain management,
- and software adaptability all matter.
Final recommendation from the episode
If you’re building AI applications:
- start small,
- use proven infrastructure where possible,
- scale only when the use case justifies it,
- and make sure your team understands both the model and the system underneath it.
The overall message: AI production is less about “having more compute” and more about orchestrating the entire stack efficiently and flexibly.
