Forked from Vect0rdecay/gist:f2e04400287a489d7fde97a0b1f7e2f6
Created
November 16, 2025 20:15
-
-
Save xtremebeing/b2fd1ccd5262e7ea0784e68e8f3c06c6 to your computer and use it in GitHub Desktop.
HL APE Framework Offensive Playbook for REFRAG Enabled AI Systems
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ~HL APE Framework Offensive Playbook for REFRAG Enabled AI Systems | |
| Using HiddenLayer's APE framework style, I created a specific instantiation of it for when testing AI systems that are using Meta's new (Sept 2025) REFRAG system for their RAG architecture. ***Keep in mind a lot of these tests would need white box access which clients often provide when asking you to test their systems. However, if you're using this for bug bounty or a client won't give access then there will be obvious limitations to the info below.*** | |
| ************************************************************************* | |
| Every threat is explicitly marked as one of: | |
| (DG) = Directly grounded in REFRAG (i.e., originates directly from documented behavior in the REFRAG paper) | |
| (DD) = Deployment-dependent (i.e., depends on how a real-world system deploys REFRAG components) | |
| (AE) = Adversarial extrapolation—plausible, but requires testing (i.e., grounded in standard adversarial ML reasoning but not proven harmful in REFRAG) | |
| ************************************************************************* | |
| Architecture Components | |
| Retrieval Layer (Standard RAG) — fetches passages from external sources. | |
| Chunk Encoder (Compression Model) — splits text into fixed-length chunks (e.g., 16 tokens) and compresses them into embeddings via a lightweight encoder (e.g., RoBERTa variant). | |
| RL Routing Policy — reinforcement-learning-driven mechanism selecting which chunks stay compressed vs. which are expanded into full tokens. | |
| Compressed Embedding Cache — stores precomputed chunk embeddings to accelerate decoding and reduce KV memory. | |
| Chunk Expansion Module — reconstructs only selected chunks into token sequences. | |
| LLM Decoder — consumes a mix of compressed embeddings (projected to token embedding space) and expanded tokens with no architecture modification. | |
| Each layer introduces distinct attack surfaces. | |
| ************************************************************************* | |
| I. Axioms | |
| Axiom 1 — Compression is lossy and therefore adversarially steerable. (DG) | |
| The encoder compresses text into low-dimensional embeddings; lossy encoding naturally introduces adversarial manipulation opportunities. | |
| Axiom 2 — Routing policy mediates what the LLM actually sees. | |
| The RL policy is a gating mechanism, making it a central decision bottleneck. | |
| Axiom 3 — Compressed embeddings create persistent state. | |
| Summary caches often live longer than the query, so poisoning them has system-wide effects. | |
| Axiom 4 — Reconstruction creates a latent-space bypass. | |
| If the LLM is trained to reconstruct text from compressed embeddings, latent payloads can bypass prompt and safety filters entirely. | |
| Axiom 5 — REFRAG changes where trust boundaries exist. | |
| The LLM is no longer the only thing to secure—the entire pre-LLM pipeline must be threat-modeled. | |
| Axiom 6 — Small models break first. | |
| The weakest component (encoder, RL router) often fails long before the LLM does. | |
| ************************************************************************* | |
| II. Objectives | |
| Objective 1 — Identify instability or collision behavior in the chunk encoder. (DG) | |
| Objective 2 — Manipulate routing decisions for expanded vs. compressed chunks. (DG) | |
| Objective 3 — Assess whether adversarial text can cause harmful reconstruction outputs. (DG) | |
| Objective 4 — Evaluate risks in any long-lived compressed embedding cache. (DD) | |
| Objective 5 — Stress-test routing via cross-chunk interference. (AE) | |
| Objective 6 — Examine retrieval poisoning effects on compression and routing. (DD) | |
| Objective 7 — Identify blind spots where the system fails to log or detect adversarial influence. (DD) | |
| ************************************************************************* | |
| III. Tactics | |
| Tactic A — Embedding-Space Probing | |
| Map how the encoder behaves under perturbation, noise injection, and crafted text. | |
| Tactic B — Routing Manipulation | |
| Induce relevance misjudgments via adversarial embeddings or content shaping. | |
| Tactic C — Latent Payload Encoding | |
| Hide instructions inside compressed summaries and rely on reconstruction to reveal them. | |
| Tactic D — Cache Corruption | |
| Persist adversarial compressed chunks in the embedding cache. | |
| Tactic E — Retrieval Surface Attacks | |
| Poison upstream data sources (corpora, user uploads, wikis, forums, etc.). | |
| Tactic F — Cross-Chunk Interference Attacks | |
| Leverage many coordinated embeddings to influence routing behavior. | |
| Tactic G — Observability Evasion | |
| Craft attacks specifically designed to bypass logs or monitoring systems. | |
| ************************************************************************* | |
| IV. Techniques (The Specific Attack Methods) | |
| Each technique belongs to one or more tactics. | |
| Technique A1 — Embedding Drift Measurement | |
| Feed slightly altered versions of a base string and measure vector distance. | |
| Purpose: Identify instability, collapse, or unintended invariances. | |
| Technique A2 — Unicode Homoglyph Injection | |
| Use visually similar characters to create semantic differences not captured in compression. | |
| Technique A3 — Adversarial Text Sculpting | |
| Craft passages whose embeddings collide with innocuous reference embeddings. | |
| Attacker goal: | |
| Make harmful chunk embeddings cluster near high-priority safe chunks. | |
| Technique B1 — Relevance Inversion | |
| Engineer text that is irrelevant but compresses into “high relevance” embeddings. | |
| Technique B2 — Expansion Denial Attack | |
| Craft text that routes safety-relevant content into the “do not expand” bin. | |
| Technique B3 — RL Reward Hacking | |
| If routing policy also has online RL, or retraining loops, bias it using repeated adversarial examples. (Note: If the RL routing policy is ever updated online or re-trained using live traffic, it becomes susceptible to reward hacking and data poisoning. The REFRAG paper only discusses offline RL training, but production systems sometimes add this kind of adaptivity.) | |
| Technique C1 — Latent Jailbreak Encoding | |
| Use benign text whose embedding, after training, reconstructs into prohibited content. (Note: Attackers control the latent space indirectly via adversarial input text that the encoder transforms into chunk embeddings; direct embedding injection is not part of the REFRAG design.) | |
| Technique C2 — Open-coded Reconstruction Biasing | |
| Force the LLM to hallucinate content during reconstruction by feeding borderline-compressible chunks. | |
| Technique D1 — Cache Persistence Exploit | |
| Inject embedding representations designed to persist based on eviction heuristics (size, recency, relevance). | |
| Technique D2 — Cross-User Poisoning | |
| Embed compressed payloads that future users consume through cached summaries. (Note: Might work if the embedding cache is shared across sessions/users, or if embeddings are stored in any long-lived KV or vector store.) | |
| Technique E1 — Retrieval Source Pollution | |
| Place poisoned documents into the indexed corpus (public docs, cloud buckets, collaborative environments). | |
| Technique F1 — Embedding Storm (“Vector Swarm”) | |
| Generate many near-duplicate embeddings to overwhelm routing priority. | |
| Technique G1 — Latent Evasion | |
| Design payloads so that the raw text looks harmless, but reconstruction produces the malicious content. | |
| This evades prompt-level logging. | |
| ************************************************************************* | |
| V. Procedures / Prompts (Operational Playbook) | |
| Below are concrete step-by-step procedures (manual or automated), including red-team prompts that trigger REFRAG vulnerabilities. | |
| Procedure 1 — Embedding Drift Test | |
| Tags: (DG) | |
| Select a base string (e.g., a neutral technical sentence). | |
| Generate variants with: | |
| -minor typos | |
| -punctuation changes | |
| -casing variations | |
| -whitespace alterations | |
| Run each through the compression encoder. | |
| Measure embedding distances and router relevance scores. | |
| Document embedding instability zones. | |
| ************************************************************************* | |
| Procedure 2 — Routing Manipulation Probe | |
| Tags: (DG / AE) | |
| Craft 20–50 chunks containing: | |
| -benign summaries | |
| -irrelevant but formal content | |
| -noise-filled semi-structured text | |
| Submit them through retrieval or direct context injection. | |
| Record which chunks the RL policy expands vs. compresses. | |
| Identify patterns that consistently mislead routing. | |
| ************************************************************************* | |
| Procedure 3 — Latent Jailbreak Attempt | |
| Tags: (DG) | |
| Provide input text with a benign surface meaning. | |
| Embed subtle adversarial patterns (e.g., unusual structure, delimiters). | |
| Inspect expanded chunk output for: | |
| -safety bypass | |
| -undesired reconstruction | |
| -unintended directives | |
| Iteratively refine adversarial input. | |
| ************************************************************************* | |
| Procedure 4 — Cache Poisoning Test | |
| Tags: (DD) | |
| Generate embeddings for 100+ benign-looking chunks. | |
| Ensure these are stored in the system’s compressed embedding cache. | |
| After session boundaries, test whether the same embeddings reappear. | |
| Observe whether future prompts incorporate poisoned embeddings. | |
| **Only relevant if caching is shared or persistent. | |
| ************************************************************************* | |
| Procedure 5 — Retrieval Manipulation | |
| Tags: (DD) | |
| Add poisoned documents to a retrievable corpus (forum posts, wiki pages, S3 bucket docs). | |
| Wait for the retriever index to update. | |
| Confirm that retrieved passages enter the compression + routing pipeline. | |
| Measure distortion from compression and what the router expands. | |
| ************************************************************************* | |
| Procedure 6 — Embedding Storm Test | |
| Tags: (AE) | |
| Generate 200–500 chunks with near-identical structure. | |
| Insert small perturbations to diversify embeddings. | |
| Submit them simultaneously. | |
| Observe drop in routing quality, stability, or expansion accuracy. | |
| ************************************************************************* | |
| Procedure 7 — Observability Evasion Test | |
| Tags: (DD / DG) | |
| Provide innocuous-looking input designed to compress into embeddings. | |
| Inspect logs to confirm whether: | |
| -compression events | |
| -reconstruction steps | |
| -router scores | |
| appear at all. | |
| Identify blind spots in pre-LLM monitoring. | |
| ************************************************************************* | |
| Procedure 8 — Adversarial Summary Sculpting | |
| Tags: (AE) | |
| Choose a benign chunk (A). | |
| Create a malicious chunk (B). | |
| Iteratively modify B until its embedding is near A’s embedding. | |
| Observe whether router treats B as relevant or expands it. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment