A practical guide based on migrating 18 production AI operations (~175 test cases) from GPT-4.1-mini to Mercury 2, a diffusion-based LLM. Every rule below was learned from a real failure and validated with automated tests.
Autoregressive models (GPT, Claude, Gemini) generate one token at a time, left to right. Each token sees everything before it. They follow instructions well because they process them sequentially while generating.
Diffusion models (Mercury, etc.) generate all tokens roughly in parallel and iteratively refine them. They are pattern completers first, instruction followers second. They match the structural shape of your prompt's output example before they reason about content.
This single difference explains every rule below.
Diffusion models pattern-match from your output example more than from your written instructions. If you describe the desired output in prose, you'll get prose back.
Before (autoregressive — works fine):
Return a JSON object with the matches you found, how many items you cleared, and a brief summary.
After (diffusion — required):
OUTPUT FORMAT (JSON only):
{
"matches": [
{"name": "matched item", "category": "which category", "confidence": "high|moderate|low", "reasoning": "why this matched"}
],
"cleared_count": 0,
"summary": "brief analysis"
}
Without an explicit JSON template, diffusion models return markdown, prose, or JSON with invented field names. This was the single most common migration failure.
Autoregressive models infer valid values from surrounding context. Diffusion models cannot — when they see "..." as a placeholder, they often output null because they don't know what to generate in that position.
Before:
"density": {"value": "...", "reasoning": "..."}After:
"density": {"value": "very_low|low|moderate|high|very_high", "reasoning": "Why this level"}Every field that has a constrained set of values needs those values listed inline in the output format. No exceptions, no shortcuts.
Autoregressive models can resolve minor naming mismatches — they understand that confidence_reasoning and reasoning mean the same thing. Diffusion models follow the prompt literally. If your output template says confidence_reasoning, that exact string becomes the JSON key, and your downstream parser expecting reasoning silently fails.
Audit every key in your OUTPUT FORMAT against your actual parsing schema. They must be identical, character for character.
Diffusion models sometimes return arrays where strings are expected, or objects where you wanted a primitive. The output example alone isn't always enough — add explicit rules.
FIELD RULES:
- "source" MUST be a single string (the primary source), never an array.
If multiple sources contribute, name only the dominant one.
- "value" MUST be a number.
- "unit" MUST be one of: "mg", "mcg", "g".
This matters most for fields where the model might reasonably produce either a string or an array (e.g., a "source" that could be one ingredient or several).
This is the most important migration insight.
Autoregressive models reliably follow lookup rules: "If the input is X, output Y." Diffusion models follow reasoning processes more reliably than memorized mappings. Under concurrent load, a hardcoded lookup rule gets dropped ~10% of the time. A step-by-step reasoning framework is followed consistently.
Before (lookup table — flakes under load):
RULES:
- Tuna → always flag mercury contamination
- Rice → always flag arsenic contamination
- Shark → always flag mercury contamination
After (reasoning framework — stable):
CONTAMINATION ASSESSMENT — apply for every item:
Step 1: Does this item have a BIOLOGICAL ACCUMULATION pathway?
(The substance concentrates up the food chain or in the growth medium.)
If yes → flag the relevant contaminant.
Step 2: Does processing this item CREATE harmful byproducts?
(e.g., high-heat processing creates carcinogens)
If yes → flag the relevant byproduct.
Step 3: If neither step applies → no contamination concern.
The lookup table tells the model WHAT to output. The framework teaches it HOW to think. Diffusion models execute reasoning steps more reliably than they recall specific mappings, because the step-by-step structure gives them a generation pattern to follow at every position simultaneously.
Measured impact: A lookup rule for a specific contamination flag went from ~90% reliable to 100% reliable under parallel load after switching to a reasoning framework.
Scales like very_negative | negative | neutral | positive | very_positive are semantically ambiguous. Does "positive" for an inflammation field mean "inflammation is present (bad)" or "health outcome is positive (good)"?
Autoregressive models usually figure this out from context. Diffusion models interpret it inconsistently — sometimes one way, sometimes the other, even within the same response.
Before (ambiguous):
"inflammation": "very_negative|negative|neutral|positive|very_positive"
After (unambiguous):
SCALE DIRECTION:
All scales measure HEALTH OUTCOME for the person, not presence of a trait.
- "very_positive" = strongly beneficial
- "positive" = beneficial
- "neutral" = no meaningful effect
- "negative" = harmful
- "very_negative" = strongly harmful
Example logic: if something REDUCES inflammation, that's a POSITIVE health outcome.
If it PROMOTES inflammation, that's NEGATIVE.
State the principle once, clearly, and let the model apply it. Don't list per-field examples — that's spoon-feeding and doesn't generalize.
Most diffusion model APIs expose a reasoning effort parameter. Higher reasoning = more internal "thinking" tokens = better logical accuracy but slower and more expensive.
Don't default to high. Test empirically:
- Run your test suite on medium reasoning
- Run it again (diffusion models are non-deterministic)
- If failures appear on the second run that weren't there on the first → bump to high
- If both runs pass → medium is sufficient
In practice, ~80% of operations work fine on medium reasoning. The ~20% that need high are operations requiring:
- Multi-step logical reasoning (evaluating whether something meets multiple criteria)
- Interpreting ambiguous directional scales
- Safety-critical classifications where a wrong answer has real consequences
This is counterintuitive. OpenAI's response_format: { type: "json_schema" } guarantees schema compliance on autoregressive models. On diffusion models, it destroys reasoning quality.
In controlled A/B testing:
| Configuration | Accuracy |
|---|---|
| No schema + medium reasoning | 4/4 |
| json_schema + medium reasoning | 0/4 |
The schema enforcement consumes the model's reasoning token budget, leaving nothing for actual classification logic. The model produces structurally perfect JSON that's semantically wrong.
Instead: Rely on prompt-based JSON instruction (your OUTPUT FORMAT block) and parse the output with a lenient-then-strict validation pipeline. The diffusion model produces valid JSON naturally when the temperature is at its minimum and the output format is explicit.
When converting an existing prompt for a diffusion model:
- Add an explicit
OUTPUT FORMAT (JSON):block with the exact structure - Replace all
"..."placeholders with actual enum values - Verify every field name matches your parsing schema exactly
- Add
FIELD RULES:for any field where the type could be ambiguous - Convert lookup-table rules to step-by-step reasoning frameworks
- Define any directional/ordinal scales explicitly with what the direction means
- Remove
json_schema/response_formatenforcement — rely on prompt-based JSON instruction - Set temperature to the model's minimum (e.g., 0.5 for Mercury)
- Run your test suite twice to catch non-deterministic failures
- Bump reasoning effort to high only for operations that fail on the second run
Without schema enforcement, diffusion models occasionally (~1% of calls) produce minor JSON syntax issues:
- JS-style comments (
// ...) inside JSON - Raw control characters (literal newlines, tabs) inside string values
A simple sanitizer handles both. The key subtlety: only escape control characters that are inside JSON string values, not structural whitespace between keys. A state machine that tracks whether you're inside a "quoted string" handles this correctly.
Diffusion models are inherently more non-deterministic than autoregressive models, especially under concurrent load. A prompt that passes 100% when calls are made one at a time may fail 5-10% when 5 calls hit the API simultaneously.
This is not a bug — it's the architecture. Mitigation strategies:
- Flywheel caching — Cache correct results so the same input doesn't need to be re-evaluated
- Retry on parse failure — A single retry almost always succeeds
- Sequential critical paths — Don't parallelize safety-critical operations
- Reasoning effort — Higher reasoning reduces variance on logical operations
- Acceptance ranges — For subjective classifications, accept a range of valid answers rather than a single exact value
Keep an operation on an autoregressive model if:
- It requires extended multi-turn reasoning (the diffusion model can't "think out loud" across multiple generation passes)
- It needs json_schema strict mode for downstream consumers that can't handle any structural variation
- The prompt is heavily few-shot dependent (diffusion models benefit less from examples than from clear structure)
- Latency doesn't matter and you'd rather have deterministic output over faster output
For everything else — especially structured JSON extraction, classification, and scoring — diffusion models are faster, cheaper, and (with proper prompting) equally accurate.