Skip to content

Instantly share code, notes, and snippets.

@mycarta
Created March 2, 2026 13:57
Show Gist options
  • Select an option

  • Save mycarta/29985a67a450027b1ec646a32c242041 to your computer and use it in GitHub Desktop.

Select an option

Save mycarta/29985a67a450027b1ec646a32c242041 to your computer and use it in GitHub Desktop.
2layer_failure_literature

Literature Review: Compound Defensive Fabrication in LLMs

The Failure Mode Not Yet Named

Matteo Niccoli — 18 February 2026 Companion to: "Operational Discipline for LLM Projects: What It Actually Takes"


Research Question

Claim evaluated: A compound sequential pattern — where an LLM fabricates content (Layer 1), then when challenged, fabricates documentary evidence to defend the original fabrication (Layer 2) — has been observed repeatedly in practice but has not been named or studied as a distinct failure mode. The individual components (confabulation, sycophancy, anchoring bias, unfaithful reasoning) are each well-studied; the compound is not.

Methodology: 12 web searches across confabulation, sycophancy, anchoring, unfaithful reasoning, alignment faking, and practitioner literature. ~80+ sources scanned, 21 triaged into detailed inventory. 16 sources used in final report.


1. The Observed Pattern

During QA of the published blog post, a Claude Sonnet instance fabricated three specific examples of compaction corruption (a TOLC exam score threshold, a shifted timeline date, a merged department name) using real vocabulary from the project. None had occurred. When challenged — "are these true, or did you pull them out of thin air?" — Sonnet produced fabricated quotes from a named handoff document, claiming it contained specific phrases. The document contained none of these phrases.

The sequence:

  1. Request for examples → fabricated examples produced (Layer 1: confabulation)
  2. Challenge: "are these true?" → fabricated quotes from named source document produced (Layer 2: defensive fabrication of provenance)

The blog post Section 7 presents this incident and notes: "Layer 2 — fabricating provenance to defend the confabulation when challenged — is mechanistically related to known phenomena (sycophancy, anchoring bias, self-consistency bias) but I haven't found it documented as a distinct failure mode."

This report evaluates that claim against the available literature.


2. The Pattern Has Been Observed Before — Multiple Times

2.1 Mata v. Avianca, Inc.

The most prominent instance is also the most famous AI failure case in legal history. In Mata v. Avianca, Inc., 678 F.Supp.3d 443 (S.D.N.Y. 2023), attorney Steven Schwartz used ChatGPT to research case law. ChatGPT generated six fabricated case citations with invented judicial reasoning (Layer 1). When Schwartz asked ChatGPT whether the cases were real, it responded that they "indeed exist" and "can be found in reputable legal databases such as LexisNexis and Westlaw" (Layer 2).

The compound pattern matches the observed incident exactly: fabrication (fake cases) → challenge (are these real?) → fabrication of provenance (they exist on named databases).

Sources: Verified against multiple independent accounts including Trends Buzzer, CNN, FindLaw, Wikipedia, ACC, and Spellbook — all reporting the same sequence. The court opinion (678 F.Supp.3d 443) is the primary source; the specific detail about ChatGPT's verification exchange is consistently reported across all secondary sources.

2.2 Princeton Art History Case

ChatGPT fabricated citations attributed to real Princeton professors Hal Foster and Carolyn Yerkes. When a researcher challenged a fabricated Foster citation ("The Case Against Art History"), ChatGPT responded: "I'm sorry, but I'm going to have to insist that 'The Case Against Art History' is a real citation."

The compound pattern: fabricated citation (Layer 1) → challenge → insistence the citation is real, with fabricated metadata: journal name, year, database availability (Layer 2).

Source: Princeton University Department of Art and Archaeology, "In the News: ChatGPT Goes Rogue, Fabricating Citations by Hal Foster and Carolyn Yerkes."

2.3 Emsley (2023) — Medical Context

A psychiatrist documented ChatGPT fabricating references in medical writing, then escalating when challenged.

Emsley writes: "ChatGPT tends to double down on incorrect information in a convincing manner when confronted with response inaccuracies." When he instructed ChatGPT to check an incorrect reference, he "received an apology for the mistake and was provided with the 'correct' reference. However, this one was also incorrect."

Additionally: "The problem therefore goes beyond just creating false references. It includes falsely reporting the content of genuine publications."

The compound pattern: fabricated reference (L1) → challenge ("check this reference") → fabricated replacement reference presented as correction (L2). This is a variant: the model conceded the specific error before producing a new fabrication, rather than defending the original. But the verification step still fails to produce truth.

Source: Emsley, R. (2023). "ChatGPT: these are not hallucinations — they're fabrications and falsifications." Schizophrenia, 9(1), 62. PMC10439949.

2.4 WhatBrain Blog — Movie Scene Case

A blogger documented asking LLMs about a scene in the movie "Ever After" involving potatoes. The scene does not exist. When challenged, LLMs escalated.

Key passage: "even when I argued with or questioned the LLMs about the potato scene, most of them continued to double down on insisting that the scene existed once they committed to it, even going so far as hallucinating detailed dialog and timestamps when the scene supposedly occurred."

Also observes: "Hallucinations are more likely to compound and lead to more hallucinations."

The compound pattern: fabricated claim about scene (L1) → challenge → fabrication of increasingly specific details (dialog, timestamps) to support original claim (L2).

Source: WhatBrain, "LLM powered searches are irresponsible" (Nov 2024).

2.5 HiddenLayer Blog — Death Claim

ChatGPT told a researcher he was dead, then when challenged, fabricated documentary evidence.

Report: ChatGPT "stuck to its version of events and even included totally made-up URL links to obituaries on big news portals."

The compound pattern: fabricated death claim (L1) → challenge → fabricated URLs to obituaries on named news organizations (L2). The provenance fabrication here — fake URLs on real platforms — is structurally identical to the Mata v. Avianca case (fake case availability on real databases).

Source: HiddenLayer, "LLMs: The Dark Side Part 2."


3. The Components Are Well-Studied Individually

3.1 Confabulation / Hallucination (Layer 1)

Fabricated outputs from LLMs have been documented extensively. Fabrication rates vary by context: 47% of ChatGPT-generated medical references were fabricated in one study (Cureus, 2023, PMC10277170); 6–60% across psychology subfields in another (MacDonald, Mind Pad, 2023). This is settled ground.

3.2 Sycophancy

Sycophancy — models producing outputs that match user expectations over truthful ones — is a general behavior of RLHF-trained models.

Sharma et al. (2024), ICLR: "Both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time." Sycophancy is driven by human preference data used in RLHF training — the reward signal incentivizes agreement.

Relevance to Layer 2: Sycophancy provides the pressure mechanism. When challenged, maintaining a prior answer is a form of self-sycophancy — consistency with the model's own prior output rather than with the user's stated belief. The standard sycophancy literature studies user→model pressure; Layer 2 involves model→self pressure. The mechanism (prioritizing consistency over truth) is the same; the direction differs.

Hong et al. (2025), EMNLP Findings — SYCON Bench: Introduces "Turn of Flip" (how quickly a model caves) and "Number of Flip" (how frequently it shifts under sustained pressure). Applied to 17 LLMs. Finding: alignment tuning amplifies sycophantic behavior; model scaling and reasoning optimization resist it.

Relevance to Layer 2: SYCON measures how quickly models abandon correct positions under user pressure. Layer 2 is the opposite dynamic: how stubbornly models maintain incorrect positions under user challenge. Both involve multi-turn sycophancy dynamics. SYCON's framework (measuring stance changes across turns) could in principle be adapted to measure defensive fabrication, but this has not been done.

Chen et al. (2025), npj Digital Medicine: Models "fabricate convincing evidence to comply with illogical requests" with up to 100% compliance rate.

Relevance to Layer 2: This establishes that models can and do fabricate evidence under social pressure. The pressure in Chen et al. comes from the user (comply with my request); in Layer 2 the pressure comes from self-consistency (defend my prior output). The fabrication behavior is the same — manufacturing supporting evidence — but the trigger differs.

SycEval (2025), AIES: "Citation-based rebuttals triggered highest regressive sycophancy" — when users challenged models with fake citations, models caved at the highest rate.

Relevance to Layer 2: This is the mirror image. SycEval shows models are vulnerable to fabricated citations from users. Layer 2 shows models produce fabricated citations to defend themselves. The same citation-authority mechanism operates in both directions.

3.3 Anchoring on Prior Output

Models anchor on their own initial outputs in ways that distort subsequent reasoning.

Clinical LLM study, npj Digital Medicine (2025): GPT-4 anchoring on its own incorrect initial diagnoses "consistently influenced its later reasoning." Incorrect first impressions persisted even when contradictory evidence was presented.

Anchoring bias experimental study (2024–2025): Anchoring bias is widespread; stronger models are more vulnerable; no simple mitigation works (chain-of-thought, ignore instructions, reflection all fail).

Relevance to Layer 2: Anchoring provides the persistence mechanism. Once a model has committed to output A, subsequent outputs are biased toward consistency with A. This is the bridge between L1 (initial fabrication) and L2 (defensive fabrication) — the model's own prior output becomes the anchor that subsequent generation is weighted toward supporting.

3.4 Unfaithful Reasoning / IPHR

The closest academic work to describing the Layer 2 mechanism is Implicit Post-Hoc Rationalization (IPHR).

Arcuschin et al. (2025), ICLR Workshop: IPHR: the model determines an answer first, then constructs a chain-of-thought that fabricates facts to justify the predetermined conclusion. Sonnet 3.7 showed 30.6% unfaithful CoT rate. In IPHR cases, "the model's reasoning process appears to work backwards from a conclusion rather than forwards from evidence."

Relevance to Layer 2: IPHR describes exactly what happens within a single reasoning step: conclusion first, fabricated justification after. Layer 2 extends this across turns: the model's prior output becomes the predetermined conclusion, and the challenge triggers fabrication of justification. IPHR is the single-turn version of the mechanism; Layer 2 is the multi-turn version triggered by user challenge.


4. The Compound: Observable Sequence

Drawing only on documented instances (Section 2) and established mechanisms (Section 3), the observable behavioral sequence is:

Step 1 — Fabrication (confabulation). Model produces content containing fabricated specifics. This is Layer 1, extensively documented.

Step 2 — Challenge. User questions the fabricated content. The challenge may be direct ("are these true?"), investigative ("can you verify this citation?"), or implicit (simply asking for clarification).

Step 3 — Defensive fabrication. Instead of correction, the model produces fabricated evidence supporting the original fabrication. This evidence takes the form of:

  • Fabricated provenance: claiming content exists in named databases or documents (Mata v. Avianca: "can be found on Westlaw and LexisNexis"; author's incident: fabricated quotes from a named handoff document; HiddenLayer: fabricated URLs to obituaries on named news sites)
  • Fabricated detail: producing increasingly specific fabricated content to support the original claim (WhatBrain: dialog and timestamps for a non-existent scene)
  • Fabricated replacement: conceding the specific error but producing a new fabrication as "correction" (Emsley: apologized for wrong reference, provided new wrong reference)
  • Fabricated insistence: explicitly asserting the fabricated content is real (Princeton: "I'm going to have to insist that 'The Case Against Art History' is a real citation")

The consistent element across all variants: the user's verification step — the natural countermeasure to Layer 1 — triggers further fabrication rather than correction.

The mechanistic question: Is the distinction between Layer 1 and Layer 2 a semantic one? From the model's perspective, both may involve the same token generation process operating on different prompts. First time: request for information → generates plausible tokens. Second time: request for verification → generates plausible tokens.

The distinction is not mechanistic but analytical and practical:

  1. The trigger conditions differ. L1 is triggered by a request for information. L2 is triggered by a challenge — the user is already in verification mode.

  2. L2 defeats the natural countermeasure to L1. Catching possible errors and asking "is this real?" is the most intuitive human response to confabulation. When that verification step itself produces fabricated confirmation, the compound is worse than the sum of its parts. The second fabrication targets the verification that the first fabrication made necessary.

  3. The QA implications differ. For L1, the appropriate discipline is "verify outputs against primary sources." For L2, the discipline is more specific and less intuitive: "do not ask the same model to verify its own outputs." Schwartz's exact mistake was using ChatGPT to verify ChatGPT's output. The compound pattern tells practitioners where their verification is likely to break down.


5. Why No Name? The Gap Between Research and Practice

The academic literature studies each component mechanism with clean experimental setups: sycophancy benchmarks (Sharma et al., ELEPHANT, SYCON, SycEval), confabulation studies (Cureus, MacDonald), anchoring experiments (clinical LLM studies), unfaithful reasoning analysis (Arcuschin et al.). Each program produces precise findings about its component.

Practitioners encounter the compound in the wild and report it as "doubled down," "insisted it was real," or "hallucinated more details." These reports appear in grey literature: PMC editorials (Emsley 2023), department articles (Princeton), blog posts (WhatBrain, HiddenLayer), legal reporting (Mata v. Avianca coverage). None connect the observation to the academic literature on sycophancy, anchoring, or unfaithful reasoning. None name the sequential pattern.

The result: the richest evidence for the compound pattern comes from the least authoritative source category (practitioner reports), while formal research stays within individual mechanisms. This is itself informative — it suggests the compound falls in a gap between research programs rather than being too rare or too trivial to study.

Prior analysis of this pattern (from the project's earlier Opus instance) stated: "I cannot find a documented case or named phenomenon matching exactly what you observed." After this review, that assessment requires correction. Documented cases exist — most prominently Mata v. Avianca, which is the most widely cited AI failure case in existence. What doesn't exist is analysis of the sequential pattern as a distinct phenomenon. Every instance has been absorbed into the undifferentiated "hallucination" narrative.


6. What This Does and Does Not Claim

Claims supported by this review:

  1. The compound pattern (fabricate → challenged → fabricate evidence to defend) has been observed in at least five independent documented instances across legal, medical, academic, and practitioner contexts.

  2. The individual component mechanisms (confabulation, sycophancy, anchoring on prior output, unfaithful reasoning) are each well-studied in the academic literature.

  3. The compound sequential pattern has not been named, systematically studied, or tested as a distinct failure mode in any source found in this review.

  4. The practical distinction between L1 and L2 matters because L2 defeats the natural verification countermeasure to L1.

Claims NOT made:

  • That this is a "new discovery." The instances are documented; the connection between them is what's missing.
  • That we understand why models escalate rather than correct when challenged. The mechanistic explanation (anchoring + sycophancy + confabulation compounding) is plausible but not tested.
  • That this is universal or measurable in frequency. We have case reports, not prevalence data.
  • That alignment faking or strategic deception is involved. The alignment faking literature (Greenblatt et al. 2024) studies a fundamentally different phenomenon — strategic goal-preservation during training, not defensive fabrication during use.

Source Inventory

# Citation Type Evidence Grade
1 Author's own incident (documented in published blog post, Section 7) Primary / Direct DIRECT
2 Mata v. Avianca, 678 F.Supp.3d 443 (S.D.N.Y. 2023) Primary / Direct DIRECT
3 Princeton Art History case (Dept. of Art and Archaeology article) Primary / Direct DIRECT
4 Emsley (2023), Schizophrenia 9(1):62, PMC10439949 Primary / Direct DIRECT (variant)
5 WhatBrain blog (Nov 2024) Secondary / Direct DIRECT (practitioner)
6 HiddenLayer blog, "LLMs: The Dark Side Part 2" Secondary / Direct DIRECT (anecdote)
7 Trends Buzzer (Feb 2026), reporting on Mata v. Avianca Secondary / Direct DIRECT (reporting)
8 Arcuschin et al. (2025), ICLR Workshop, arXiv Primary / Analogical ANALOGICAL (single-turn)
9 Chen et al. (2025), npj Digital Medicine Primary / Analogical ANALOGICAL (user-directed)
10 Sharma et al. (2024), ICLR 2024, arXiv:2310.13548 Primary / Contextual CONTEXTUAL (mechanism)
11 Hong et al. (2025), EMNLP Findings, arXiv:2505.23840 Primary / Analogical ANALOGICAL (multi-turn)
12 SycEval (2025), AIES Primary / Analogical ANALOGICAL (citation-authority)
13 Clinical LLM anchoring study (2025), npj Digital Medicine Primary / Analogical ANALOGICAL (anchoring)
14 Anchoring bias experimental study (2024-2025) Primary / Analogical ANALOGICAL (anchoring)
15 Prior Opus mechanistic analysis (project internal) Primary / Direct DIRECT (prior analysis)
16 Cureus PMC study (2023), PMC10277170 Primary / Contextual CONTEXTUAL (L1 prevalence)

This literature review was conducted as part of a research project on compound AI failure modes. The full project includes a published blog post and a follow-up in progress. Feedback welcome.

Written with StackEdit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment