Kaizen! Let it crash (Friends)

Summary of Kaizen! Let it crash (Friends)

by Changelog Media

1h 41mJanuary 17, 2026

Overview of Kaizen! Let it crash (Friends)

This episode (Kaizen 22) of Changelog & Friends features Gerhard Lazu leading a deep-dive into the “let it crash” philosophy (Erlang/BEAM) and a real-world incident response story: debugging repeated Varnish/edge crashes and odd traffic patterns for the Changelog CDN (Pipe Dream). The conversation mixes operational detail (what broke, why, and how it was fixed), observability lessons, configuration pitfalls on Fly.io, and follow-up mitigations for harmful traffic patterns that were discovered.

Key topics discussed

  • "Let it crash" philosophy (Erlang/Beam supervision trees) and tradeoffs vs defensive coding (Go-style error handling)
  • Varnish cache behavior and causes of out-of-memory (OOM) events
  • How Fly.io surfaces OOM events and kills offending processes without rebooting VMs
  • Investigation of numerous out-of-memory / thread kills impacting the CDN
  • Root cause analysis: large MP3s, memory fragmentation, aborted/slow client downloads, and cache storage behavior
  • Implemented fixes: Varnish file-backed cache, tuning thread pools/limits, Fly deployment sizing, tests (VTC, mocking), and CI checks
  • Follow-up problem: intermittent hanging in specific Fly regions due to Fly proxy misconfiguration (connections vs requests) and HTTP/2 quirks
  • Abuse / heavy-download problem: one podcast episode (episode 456) being repeatedly downloaded by thousands of IPs across regions, generating huge bandwidth and cost
  • Operational lessons on observability, throttling, and defensive platform configuration

Technical findings & root causes

  • Symptom: repeated varnish child/thread kills (OOM events). Between October and December the system saw ~43 OOM events for the service discussed.
  • Memory behavior: instance memory jumped from ~4GB to ~16GB in spikes; CPU and memory contention followed, then the offending thread/process was killed and varnish restarted the worker quickly.
  • Primary root cause:
    • Thousands of large MP3 files (30–100+ MB each) being cached; many requests trigger large back-end fetches.
    • Memory fragmentation and “holes” mean large objects sometimes cannot fit, producing forced evictions (LRU nukes). Varnish exposes "n_lru_nuked".
    • Clients requesting large files often abort (client disconnects or range requests): varnish fetches backend data but client disappears, leaving varnish to buffer/handle fragments and pressure memory.
    • Default behavior like bresp.do_stream and “uncacheable” objects complicate buffering.
  • Region/traffic characteristics:
    • Hot regions: SJC (San Jose) and NRT (Tokyo) show high load.
    • Observed peak incoming to varnish: ~2.29 Gbps in, but only ~145 Mbps out (many clients aborting/partial downloads).
    • One instance served ~3 TB in 5 days; over 60 days that popular episode produced tens of terabytes (e.g., SJC ~30 TB, Tokyo ~15 TB reported for the hot episode window).
  • Disk/storage issues:
    • File cache turned on but disk allocator failures and fragmentation noted; disk reported ~97% full (~48 GB used in the example).
    • Varnish fell back to SMA RAM when disk preallocation failed, causing thrash.

Fixes implemented (PRs, configuration & tests)

  • PR #44: Introduced storing large MP3s in Varnish file cache (file-backed storage) instead of keeping everything in RAM; refactor and many tests added (VCL splits, asset/backend tests).
  • Tuned varnish internals and runtime parameters:
    • thread pool sizing (min/max), workspace and backend workspace sizes, memory limits
    • file cache size tied to the instance disk (provisioned disk → allocated file cache)
    • adjusted varnish to use reasonable % of memory/disk
  • Fly.io deployment changes:
    • Use appropriately sized instances across regions (deploy then scale down cold regions)
    • Disabled some HA patterns when needed, set per-region instance sizing, and configured environment variables for varnish sizing
  • Fix for hanging requests in specific regions (e.g., EWR/Newark):
    • Found Fly config had concurrency set to "connections" (long-lived) instead of "requests" → caused thousands of long-lived connections blocking new traffic
    • Corrected to "requests", added idle timeouts, removed conflicting "services" + "http_services" Wing/config combos
    • Added per-hour CI check (hurl) that exercises all regions and detects hangs/timeouts
  • Observability: expanded Varnish stats, Honeycomb instrumentation, and dashboards to detect LRU nukes, disk allocation failures, CPU/memory spikes, and region anomalies

Results & metrics after changes

  • One busy Varnish instance: stable uptime ~5+ days with zero child panics or thread failures after tuning
  • Hit ratio ~90%+ (good cache efficiency); example: 93% hit ratio reported on one instance
  • Reduced LRU nukes and better memory behavior for many regions after file cache and tuning
  • Some storage/disk fragmentation issues remained and required further tuning or bigger disk file preallocation
  • CI hourly check identifies region-specific problems; running a global "check-all" can use substantial bandwidth (it downloads MP3s)

Outstanding issues and recommended action items

  1. Throttling / rate-limiting for heavy downloaders
    • Implement per-resource and per-IP throttling (Varnish VMOD throttle or similar) that can:
      • limit downloads for large MP3s
      • rate-limit per-IP or per-URL signature (e.g., the heavily requested episode)
      • ensure fair usage without unduly penalizing legitimate users
  2. Consider targeted mitigations for single hot episodes
    • Quick pragmatic fix: redirect problematic episode(s) to origin (R2) or sign a temporary CDN redirect so Varnish/CDN are bypassed
    • Longer: per-episode toggles (turn off CDN caching for a given episode or serve from origin)
  3. Disk capacity and allocator tuning
    • Increase file cache size / allocate more disk or pre-allocate contiguous file region to avoid disk allocation failures
    • Monitor "smf c fail" and disk fragmentation stats, and scale accordingly
  4. Fly.io config hardening
    • Add or request better validation/warnings in Fly CLI for common misconfigurations (e.g., mixing services + http_services, concurrency:connections vs requests)
    • Use region-aware sizing strategy but be aware Fly requires same instance-size across regions at deploy time → use rolling replace to tune hot regions
  5. Monitoring & observability
    • Continue hourly region checks (but be mindful of bandwidth/costs)
    • Group and analyze logs by user-agent and client IP to identify scraping clients and trends (be aware UA/IP spoofing)
  6. Policy and enforcement
    • If throttling insufficient, consider blocking/netblock mitigation as a last resort (recognize collateral damage risk)
    • Communicate with heavy users if identifiable and possible (polite request to stop if abuse appears to be an odd automated job)
  7. Test/CI considerations
    • Keep tests (VCL, VTC, mocked backends) and pre-deploy checks to avoid regressions
    • Ensure benchmarking/diagnostic routines exclude internal test IPs from production metrics

Notable insights & quotes

  • “Let it crash — but in a controlled way.” The team embraces letting parts fail (supervision) while ensuring the whole system remains stable.
  • Varnish’s behavior of per-thread/per-worker kills (instead of full VM crashes) made multiple OOM events tolerable — fast restarts and least-impact failures.
  • Observability is decisive: being able to see per-request, per-region, per-IP behavior (Honeycomb + varnish-stat + Fly metrics) made diagnosing nuanced problems feasible.
  • The internet now shows new traffic patterns (LLM crawlers, bots, automated scrapers) that can produce very high-bandwidth bursts that didn’t exist years ago.

Resources & sponsors mentioned

  • Erlang book / Let it Crash: Fred Hebert (Ferd) — recommended reading for the philosophy and approach
  • Fly.io — platform used to run the instances; Fly surfaced process OOM events
  • Namespace.so & Depot.dev — sponsors discussed (build/CI speedups)
  • Squarespace — sponsor for website builder
  • Tools mentioned: Varnish (file backend, VMODs), Honeycomb (observability), hurl (CI checks), AppCleaner/Mole tools for macOS (light personal bits), Flyctl CLI

Bottom line

  • The team fixed the primary cause (memory pressure from many large MP3s) by moving large objects to a file-backed cache and tuning varnish & Fly config. That greatly reduced thread kills and improved stability.
  • New classes of problems were exposed: Fly proxy misconfigurations leading to region-specific hangs and deliberate/accidental heavy downloads of a single episode by thousands of IPs producing large bandwidth/cost impacts.
  • Observability, CI checks, and defensive edge protections (throttling/rate-limiting) are now the key next steps to keep the cache healthy and equitable for real users while mitigating abusive clients.

Actionable short checklist for teams facing the same problems:

  • Add file-backed caching for very large cacheable objects (audio/video).
  • Tune varnish thread pools, workspace sizes and memory/disk usage based on instance size.
  • Ensure edge proxy concurrency is set to requests (not connections) for HTTP apps, and set reasonable idle timeouts.
  • Add hourly (or otherwise regular) region checks for end-to-end behavior but watch bandwidth costs.
  • Implement per-resource/throttle VMOD or edge rate limits; consider per-episode redirects to origin if needed.
  • Use observability (group by UA/IP/url) to find scrapers and abusive patterns; act pragmatically (throttle first, block only if necessary).

If you want a quick pointer to the most relevant PRs and dashboards mentioned: look at PR #44 (file cache + varnish tuning), PR #49 (fly/timeout/config fixes), and the hourly hurl check in the Pipe Dream repo referenced in the show notes.