Overview of Chasing Away Repetitive LLM Responses with Verbalized Sampling
Hosts Ben Jaffe and Katie Malone explain why large language models (LLMs) often produce repetitive, “safe” outputs for creative tasks and present a simple prompt-based mitigation called verbalized sampling. The episode summarizes a paper that (1) diagnoses mode collapse as a consequence of typicality bias in human preference data used for alignment, (2) proves how that bias can mathematically cause collapse, and (3) shows that asking models to verbalize samples and their probabilities recovers the richer pre-training distribution with little or no quality loss.
Key takeaways
- “Mode collapse”: After alignment (RLHF / human preference tuning), LLM outputs tend to concentrate on a small set of typical, safe responses, reducing diversity.
- Typicality bias: Human annotators prefer more familiar (“typical”) responses, and those aggregated preferences push the model toward safe, middle-of-the-road outputs.
- Verbalized sampling: Asking the model to output multiple candidates plus numeric probabilities (i.e., verbalize a sampled distribution) restores diversity by forcing the model to expose its underlying pre-training distribution.
- Practical and simple: A small prompt/system-prompt change produces much more diverse outputs and often maintains — sometimes improves — quality. The effect is stronger on larger, more capable models.
Why repetitiveness (mode collapse) happens
- Pre-training produces a rich internal probability distribution over many valid outputs.
- Alignment (human feedback / RLHF) trains the model to prefer typical answers humans select, narrowing the effective output distribution.
- Typicality bias (a cognitive phenomenon) causes annotators to favor familiar examples; when used as the basis for alignment, this produces a systematic bias toward common outputs.
- The paper provides empirical analysis of preference datasets and a mathematical argument showing that typicality-biased preference data can cause mode collapse.
What verbalized sampling is and why it works
- Idea: Ask the model to generate multiple candidate outputs and include a numeric probability for each candidate, i.e., make it “verbalize” its sampling distribution rather than returning a single top answer.
- Why it works: The model still encodes the richer pre-training distribution; verbalized sampling nudges it to reveal and sample from that fuller distribution rather than only the alignment-preferred mode.
- Result: Outputs match the pre-training distribution much more closely (more variety, including long-tail items) while largely preserving quality.
Concrete prompt / system suggestion
Example system-style instruction from the paper (paraphrased): You are a helpful assistant. For each query, generate a set of five possible responses (each in a separate response tag). For each response include text and a numeric probability. Please sample at random from the full distribution, or optionally sample specifically from the tails (e.g., probabilities < 0.1).
Example user prompt patterns:
- “Give me five jokes about coffee and their estimated probabilities.”
- “List five possible image captions for this prompt and the probability for each.”
Adjust the number of responses (5 in the paper), and whether to encourage tail sampling based on how much diversity you want.
Evidence and examples
- Live demo in episode: repeatedly asking “Name a U.S. state” produced only a few states (e.g., Oregon, Colorado, Texas). Asking for 5 states + probabilities produced a distribution closer to the pre-training frequency (more diverse states).
- Paper examples cover story generation, image-captioning, and other creative tasks — verbalized sampling increases diversity and often aligns with pre-training distributions.
- Quantitative finding: diversity increases without a corresponding drop in judged quality; in some cases quality improves.
- Effect size is larger on stronger models (e.g., GPT-4, Claude 3.5 at the time of writing).
Practical recommendations
- Use verbalized sampling when you want creative diversity, brainstorming, or multiple alternative outputs (jokes, dialogue, story ideas, image prompts).
- For tasks requiring a single best/accurate answer (e.g., factual question with one correct response), request a single top answer instead.
- Tune:
- Number of candidates (more -> more coverage).
- Whether to sample from full distribution or tail (to encourage novelty).
- Try it on your favorite model — larger/more capable models tend to show bigger improvements.
Limitations and caveats
- Tail outputs can include lower-quality or less relevant items; however, the paper found quality didn’t fall substantially and sometimes rose.
- This technique addresses diversity at inference time; it doesn’t remove the underlying typicality bias in alignment data.
- Might need calibration for different models and tasks; effects reported stronger on more capable models.
Notable quotes / insights
- “Alignment is not making models less capable; it’s teaching them to favor typical outputs — the knowledge is still there, just hidden behind a collapsed distribution.”
- Simple prompt changes can recover creative diversity that was present in pre-training.
Bottom line
If you want LLM creativity and diversity back, ask the model to show (and sample from) its distribution: generate multiple candidate answers with probabilities. It’s an inexpensive, practical way to recover generative richness while keeping or improving answer quality—especially effective on larger models. Links to the paper and examples are provided in the episode show notes.