RAG That Survives Production

Summary of RAG That Survives Production

by Massive Studios

22mJanuary 14, 2026

Overview of RAG That Survives Production (The Cloudcast)

This episode of The Cloudcast (hosts Aaron Delp and Brian Gracely) features Adam Camor, co-founder and head of engineering at Tonic.ai. The discussion focuses on practical paths for customizing LLMs—prompting, RAG, and fine-tuning—and the real-world challenges of getting Retrieval-Augmented Generation (RAG) systems into production: data quality, privacy, evaluation, ROI, and operational constraints. Adam also describes Tonic’s work on data de-identification and their open-source tool, Tonic Validate, for tracking LLM output quality.

Key takeaways

  • Start simple: try prompting first, then RAG, then fine‑tuning if needed. “Do the simplest thing that works.”
  • RAG can yield quick, mediocre gains, but production-grade RAG is hard due to data prep, chunking, retrieval design, and pipelines.
  • Data quality (cleaning, removing PII/PHI) is often the dominant engineering challenge and a gating factor for RAG and fine-tuning.
  • Evaluating LLM/RAG output is difficult—automated metrics are imperfect and often rely on other LLMs; human-labeled samples remain the most reliable baseline.
  • ROI / risk calculus matters: some use cases (back‑office automation) show clear ROI; others (web chat widgets) are harder to justify.
  • LLMs are not reliably good at detecting PII; specialized tools or NER models are often required at scale.
  • Track trends over time (not just absolute scores) when monitoring model quality—this is the main value of frameworks like Tonic Validate.

Topics discussed

  • Paths to customize LLMs
    • Prompting (zero/few-shot)
    • RAG (documents → embeddings → retrieval → LLM)
    • Fine-tuning (additional training on your data)
  • RAG evolution and the “trough of disillusionment”
  • Data preparation and privacy concerns (PII vs PHI, HIPAA)
  • Evaluation and validation of LLM outputs (human sampling vs automated evaluators)
  • Practical operational considerations (deployment, GPUs, costs)
  • Synthetic data and de‑identification strategies
  • Tonic.ai’s product positioning and Tonic Validate (open source)

Notable insights & quotes

  • “Do the simplest thing that works.” — advice for teams choosing between prompting, RAG, and fine‑tuning.
  • “If you want mediocre results, you can get them with RAG relatively pain‑free… but production is harder.” — on RAG’s limits.
  • “You’re going to have to use another LLM to evaluate an LLM… turtles all the way down.” — on the pitfalls of automated LLM evaluation.
  • LLMs are surprisingly poor at reliably identifying PII at scale; specialized NER/de‑identification tends to be better and cheaper for large datasets.

Practical recommendations / action checklist

  • Choose the simplest approach that meets requirements:
    • Prompt first → if insufficient, try RAG → if still insufficient, invest in fine‑tuning.
  • Before RAG or fine‑tuning, invest in data hygiene:
    • Sample your data and annotate ground truth for evaluation.
    • Remove or de‑identify PII/PHI where required (use NER tools or synthesis).
  • Evaluation strategy:
    • Create a human-labeled evaluation set from your own data to compute accuracy/F1.
    • Use frameworks (e.g., Tonic Validate) to run custom scoring functions and track trends over time.
    • Prefer trend monitoring over single absolute automated scores.
  • Cost & deployment:
    • Consider SaaS vs self-hosted (air‑gapped) based on compliance.
    • Plan for GPU access and cloud compatibility; GPU availability can be a practical bottleneck.
  • Start open-source for exploration:
    • Try Presidio, spaCy, and other NER tools for identifying sensitive entities before evaluating vendor solutions.

When to prefer each approach

  • Prompting: fast experiments, low cost, works for many simple/short-context tasks.
  • RAG: when you need larger context than a prompt can hold, but expect heavier infra and data engineering.
  • Fine‑tuning: when you need consistently high accuracy and are willing to invest in training infrastructure and evaluation.

Tools & resources mentioned

  • Tonic.ai — de-identification, synthetic data, data preparation (commercial product)
  • Tonic Validate — open-source framework for scoring and tracking LLM outputs
  • Open-source NER / de-id options: Presidio, spaCy
  • General note: many LLMs/solutions are available; test multiple base models because strengths vary by use case.

Final practical next steps (for teams)

  1. Define success metrics for your use case (accuracy, latency, compliance, ROI).
  2. Sample and label a representative subset of your data for evaluation.
  3. Try prompt engineering across a few LLMs; measure outcomes.
  4. If prompts fail, prototype a RAG pipeline with careful chunking and retrieval, and measure again.
  5. If you still need better results, plan a fine‑tuning path (with cleaned/de‑identified training data).
  6. Use an evaluation framework (or human sampling) to track model quality and regressions over time.

For teams dealing with sensitive/unstructured data, prioritize de‑identification and compliance early—this is often the true blocker to using data for RAG or fine‑tuning.