Mode collapse happens through typicality bias in preference data, which gets amplified by alignment training.
The mechanism (Section 3):
-
Human annotators have cognitive biases: They prefer text that's familiar, fluent, and predictable (mere-exposure effect, processing fluency, availability heuristic).
-
This creates biased preference data: When choosing between two responses of equal quality, annotators systematically favor the more "typical" one—the response that sounds more like what they've seen before. The paper models this as r(x,y) = r_true(x,y) + α·log π_ref(y|x), where α=0.57±0.07.
-
Alignment amplifies the bias: RLHF optimizes toward higher rewards. Since typical responses get higher rewards (even controlling for quality), the model learns to concentrate probability mass on stereotypical outputs.
-
Result: A base model might assign equal probability to 100 different jokes. After alignment, it assigns 95% probability to one joke ("Why did the coffee file a police report?") and 5% to everything else. The distribution has collapsed to a narrow mode.
Critical insight: Even with perfect optimization algorithms, the bias in the preference data itself drives collapse.
Mode collapse occurs when alignment training narrows the model's output distribution from many possible responses to a few stereotypical ones.
Before alignment (base model):
- Model trained on diverse internet text
- Ask "tell me a coffee joke" → many jokes possible with roughly similar probabilities
- Outputs reflect the actual diversity of jokes in the training data
During preference collection:
- Annotators compare responses: joke A vs joke B
- Even if both jokes are equally good, annotators pick the more familiar-sounding one
- This bias contaminates the preference dataset
After alignment (RLHF/DPO):
- Model optimized to match human preferences
- Learns that typical = preferred = high reward
- Probability distribution sharpens around stereotypical responses
- Ask "tell me a coffee joke" 5 times → same joke 5 times
The paper shows this on HELPSTEER: when controlling for correctness, annotators still prefer responses with higher base model likelihood (α=0.57). This preference for typicality, not quality differences, causes the collapse.
Mode collapse is a two-stage process: bias introduction during data collection, then bias amplification during training.
Stage 1: Typicality bias enters preference data (Section 3.1)
Human psychology creates systematic preferences unrelated to task quality:
- Mere-exposure effect: People like what they've seen before
- Processing fluency: Easy-to-process text feels more correct
- Availability heuristic: Common phrases come to mind faster, feel more probable
When annotators rate responses, they unconsciously favor conventional text. The paper verified this: in 6,874 response pairs with identical correctness scores, annotators still preferred higher-likelihood responses (p < 10^-14).
Stage 2: Alignment amplifies the bias (Section 3.2)
RLHF maximizes expected reward: max E[r(x,y) - β·KL(π||π_ref)]
Since r includes typicality bias (r = r_true + α·log π_ref), the optimization pushes probability mass toward typical completions. The KL regularization term limits how far you can move from the base model, but it doesn't prevent sharpening the distribution within that constraint.
Result: When multiple high-quality responses exist (creative tasks), typicality becomes the tie-breaker. The model learns "safe = typical," collapsing to narrow modes.
The paper's key insight: mode collapse isn't primarily an algorithmic failure—it's a data problem.
Traditional explanation (wrong): RLHF's optimization or KL regularization causes mode collapse by favoring majority responses.
This paper's explanation (correct): The preference data itself contains systematic bias before any algorithm sees it.
Here's how it works:
Step 1: Annotators evaluate responses. For creative tasks like joke generation, many responses are roughly equally good. But annotators don't rate them equally—they prefer familiar-sounding responses due to cognitive biases.
Step 2: This creates preference data where r(x,y) = quality + typicality, not just quality alone. The paper quantifies this: α=0.57 means typicality contributes significantly to perceived reward.
Step 3: Any alignment method (RLHF, DPO, etc.) trained on this biased data will learn to maximize biased rewards. The model doesn't know α should be zero—it just learns "typical responses get higher ratings."
Step 4: During inference, the aligned model generates typical responses because that's what maximized reward during training. Distribution collapses to high-typicality modes.
Why this matters: You can't fix mode collapse by improving RLHF algorithms alone. The contamination is in the data. You need either (1) bias-corrected preference collection or (2) inference-time workarounds like VS.
Mode collapse happens because different prompts collapse to different modes, and standard prompts collapse to stereotypical outputs.
Think about what happens when you prompt a model:
Instance-level prompt ("Tell me a joke"):
- Model generates text token by token
- At each step, picks high-probability tokens
- In an aligned model, the highest-probability completion is the most stereotypical one
- The modal response to this prompt is one specific joke
- Sampling doesn't help much—you're sampling from a sharply peaked distribution
Why the distribution is peaked:
- Base model: broad distribution over many jokes
- Preference data: annotators preferred typical jokes (α=0.57)
- RLHF: optimized to match preferences
- Result: typical joke has much higher probability than creative alternatives
The collapse:
- Base model might have P(joke_typical) = 0.05, P(joke_creative) = 0.04 (small difference)
- After alignment: P(joke_typical) = 0.60, P(joke_creative) = 0.02 (large difference)
- Most samples land in the narrow "typical" mode
Why VS works:
- Distribution-level prompt ("Generate 5 jokes with probabilities")
- The modal response to THIS prompt is a diverse set of jokes
- Different prompt structure → different mode
- This mode happens to approximate the base model's diversity
The paper's Figure 1 illustrates this: direct prompting always produces "Why did the coffee file a police report," while VS produces diverse jokes with stated probabilities.