ArshansGithub/diffusion-llm-prompt-migration.md

## diffusion-llm-prompt-migration.md

      
    Raw
  

              diffusion-llm-prompt-migration.md
            
          
    Migrating Prompts from Autoregressive to Diffusion LLMs


A practical guide based on migrating 18 production AI operations (~175 test cases) from GPT-4.1-mini to Mercury 2, a diffusion-based LLM. Every rule below was learned from a real failure and validated with automated tests.


How Diffusion LLMs Differ

Autoregressive models (GPT, Claude, Gemini) generate one token at a time, left to right. Each token sees everything before it. They follow instructions well because they process them sequentially while generating.
Diffusion models (Mercury, etc.) generate all tokens roughly in parallel and iteratively refine them. They are pattern completers first, instruction followers second. They match the structural shape of your prompt's output example before they reason about content.
This single difference explains every rule below.

Rule 1: Always Show the Exact Output Structure

Diffusion models pattern-match from your output example more than from your written instructions. If you describe the desired output in prose, you'll get prose back.
Before (autoregressive — works fine):
Return a JSON object with the matches you found, how many items you cleared, and a brief summary.

After (diffusion — required):
OUTPUT FORMAT (JSON only):
{
  "matches": [
    {"name": "matched item", "category": "which category", "confidence": "high|moderate|low", "reasoning": "why this matched"}
  ],
  "cleared_count": 0,
  "summary": "brief analysis"
}

Without an explicit JSON template, diffusion models return markdown, prose, or JSON with invented field names. This was the single most common migration failure.

Rule 2: List Every Valid Enum Value Inline

Autoregressive models infer valid values from surrounding context. Diffusion models cannot — when they see "..." as a placeholder, they often output null because they don't know what to generate in that position.
Before:
"density": {"value": "...", "reasoning": "..."}
After:
"density": {"value": "very_low|low|moderate|high|very_high", "reasoning": "Why this level"}
Every field that has a constrained set of values needs those values listed inline in the output format. No exceptions, no shortcuts.

Rule 3: Prompt Field Names Must Exactly Match Your Schema

Autoregressive models can resolve minor naming mismatches — they understand that confidence_reasoning and reasoning mean the same thing. Diffusion models follow the prompt literally. If your output template says confidence_reasoning, that exact string becomes the JSON key, and your downstream parser expecting reasoning silently fails.
Audit every key in your OUTPUT FORMAT against your actual parsing schema. They must be identical, character for character.

Rule 4: Add Explicit Type Constraints for Ambiguous Fields

Diffusion models sometimes return arrays where strings are expected, or objects where you wanted a primitive. The output example alone isn't always enough — add explicit rules.
FIELD RULES:
- "source" MUST be a single string (the primary source), never an array.
  If multiple sources contribute, name only the dominant one.
- "value" MUST be a number.
- "unit" MUST be one of: "mg", "mcg", "g".

This matters most for fields where the model might reasonably produce either a string or an array (e.g., a "source" that could be one ingredient or several).

Rule 5: Teach Reasoning Frameworks, Not Lookup Tables

This is the most important migration insight.
Autoregressive models reliably follow lookup rules: "If the input is X, output Y." Diffusion models follow reasoning processes more reliably than memorized mappings. Under concurrent load, a hardcoded lookup rule gets dropped ~10% of the time. A step-by-step reasoning framework is followed consistently.
Before (lookup table — flakes under load):
RULES:
- Tuna → always flag mercury contamination
- Rice → always flag arsenic contamination
- Shark → always flag mercury contamination

After (reasoning framework — stable):
CONTAMINATION ASSESSMENT — apply for every item:

Step 1: Does this item have a BIOLOGICAL ACCUMULATION pathway?
  (The substance concentrates up the food chain or in the growth medium.)
  If yes → flag the relevant contaminant.

Step 2: Does processing this item CREATE harmful byproducts?
  (e.g., high-heat processing creates carcinogens)
  If yes → flag the relevant byproduct.

Step 3: If neither step applies → no contamination concern.

The lookup table tells the model WHAT to output. The framework teaches it HOW to think. Diffusion models execute reasoning steps more reliably than they recall specific mappings, because the step-by-step structure gives them a generation pattern to follow at every position simultaneously.
Measured impact: A lookup rule for a specific contamination flag went from ~90% reliable to 100% reliable under parallel load after switching to a reasoning framework.

Rule 6: Define Directional Scales Unambiguously

Scales like very_negative | negative | neutral | positive | very_positive are semantically ambiguous. Does "positive" for an inflammation field mean "inflammation is present (bad)" or "health outcome is positive (good)"?
Autoregressive models usually figure this out from context. Diffusion models interpret it inconsistently — sometimes one way, sometimes the other, even within the same response.
Before (ambiguous):
"inflammation": "very_negative|negative|neutral|positive|very_positive"

After (unambiguous):
SCALE DIRECTION:
All scales measure HEALTH OUTCOME for the person, not presence of a trait.
- "very_positive" = strongly beneficial
- "positive" = beneficial  
- "neutral" = no meaningful effect
- "negative" = harmful
- "very_negative" = strongly harmful

Example logic: if something REDUCES inflammation, that's a POSITIVE health outcome.
If it PROMOTES inflammation, that's NEGATIVE.

State the principle once, clearly, and let the model apply it. Don't list per-field examples — that's spoon-feeding and doesn't generalize.

Rule 7: Use reasoningEffort: high Selectively

Most diffusion model APIs expose a reasoning effort parameter. Higher reasoning = more internal "thinking" tokens = better logical accuracy but slower and more expensive.
Don't default to high. Test empirically:

Run your test suite on medium reasoning
Run it again (diffusion models are non-deterministic)
If failures appear on the second run that weren't there on the first → bump to high
If both runs pass → medium is sufficient

In practice, ~80% of operations work fine on medium reasoning. The ~20% that need high are operations requiring:

Multi-step logical reasoning (evaluating whether something meets multiple criteria)
Interpreting ambiguous directional scales
Safety-critical classifications where a wrong answer has real consequences


Rule 8: Don't Use Strict JSON Schema Enforcement

This is counterintuitive. OpenAI's response_format: { type: "json_schema" } guarantees schema compliance on autoregressive models. On diffusion models, it destroys reasoning quality.
In controlled A/B testing:


Configuration
Accuracy


No schema + medium reasoning
4/4


json_schema + medium reasoning
0/4


The schema enforcement consumes the model's reasoning token budget, leaving nothing for actual classification logic. The model produces structurally perfect JSON that's semantically wrong.
Instead: Rely on prompt-based JSON instruction (your OUTPUT FORMAT block) and parse the output with a lenient-then-strict validation pipeline. The diffusion model produces valid JSON naturally when the temperature is at its minimum and the output format is explicit.

Quick Migration Checklist

When converting an existing prompt for a diffusion model:

 Add an explicit OUTPUT FORMAT (JSON): block with the exact structure
 Replace all "..." placeholders with actual enum values
 Verify every field name matches your parsing schema exactly
 Add FIELD RULES: for any field where the type could be ambiguous
 Convert lookup-table rules to step-by-step reasoning frameworks
 Define any directional/ordinal scales explicitly with what the direction means
 Remove json_schema / response_format enforcement — rely on prompt-based JSON instruction
 Set temperature to the model's minimum (e.g., 0.5 for Mercury)
 Run your test suite twice to catch non-deterministic failures
 Bump reasoning effort to high only for operations that fail on the second run


JSON Sanitization

Without schema enforcement, diffusion models occasionally (~1% of calls) produce minor JSON syntax issues:

JS-style comments (// ...) inside JSON
Raw control characters (literal newlines, tabs) inside string values

A simple sanitizer handles both. The key subtlety: only escape control characters that are inside JSON string values, not structural whitespace between keys. A state machine that tracks whether you're inside a "quoted string" handles this correctly.

Non-Determinism Under Load

Diffusion models are inherently more non-deterministic than autoregressive models, especially under concurrent load. A prompt that passes 100% when calls are made one at a time may fail 5-10% when 5 calls hit the API simultaneously.
This is not a bug — it's the architecture. Mitigation strategies:

Flywheel caching — Cache correct results so the same input doesn't need to be re-evaluated
Retry on parse failure — A single retry almost always succeeds
Sequential critical paths — Don't parallelize safety-critical operations
Reasoning effort — Higher reasoning reduces variance on logical operations
Acceptance ranges — For subjective classifications, accept a range of valid answers rather than a single exact value


When NOT to Migrate

Keep an operation on an autoregressive model if:

It requires extended multi-turn reasoning (the diffusion model can't "think out loud" across multiple generation passes)
It needs json_schema strict mode for downstream consumers that can't handle any structural variation
The prompt is heavily few-shot dependent (diffusion models benefit less from examples than from clear structure)
Latency doesn't matter and you'd rather have deterministic output over faster output

For everything else — especially structured JSON extraction, classification, and scoring — diffusion models are faster, cheaper, and (with proper prompting) equally accurate.
Configuration	Accuracy
No schema + medium reasoning	4/4
json_schema + medium reasoning	0/4
No results found