Skip to content

Instantly share code, notes, and snippets.

@stvhay
Last active February 21, 2026 14:39
Show Gist options
  • Select an option

  • Save stvhay/a2924174b187b414e326fff136326d15 to your computer and use it in GitHub Desktop.

Select an option

Save stvhay/a2924174b187b414e326fff136326d15 to your computer and use it in GitHub Desktop.
STPA-Sec Methodology for AI Agent Security Assessment

STPA-Sec: Systems-Theoretic Security Analysis for AI Agents

What is STPA?

Systems-Theoretic Process Analysis (STPA) is a hazard analysis method developed by Nancy Leveson at MIT. Unlike traditional methods that decompose systems into failure-prone components, STPA models systems as control structures and asks: under what conditions do control actions become unsafe?

Traditional safety analysis (fault trees, FMEA) assumes accidents result from component failures in a chain of events. STPA recognizes that accidents can occur without component failure — through unsafe interactions between components that individually function correctly. This matters for AI systems, where the LLM may behave exactly as designed yet produce unsafe outcomes.

Key references

  • Leveson, N.G. Engineering a Safer World: Systems Thinking Applied to Safety. MIT Press, 2012.
  • Young, W. & Leveson, N.G. "An Integrated Approach to Safety and Security Based on Systems Theory." Communications of the ACM, 57(2), 2014.
  • Leveson, N.G. & Thomas, J.P. STPA Handbook. MIT, 2018. Available at psas.scripts.mit.edu
  • Ayvali, E. "Systematic Hazard Analysis for Frontier AI using STPA." arXiv:2506.01782, 2025.

STPA-Sec: The Security Extension

STPA-Sec adapts STPA for security analysis. Where STPA asks "what control actions are unsafe?", STPA-Sec asks "what control actions are insecure?" — that is, what happens when an adversary manipulates the control structure?

The method treats security violations as losses (not just failures), and identifies how an attacker can cause the controller to issue control actions that are:

  1. Not provided — the security control fails to act (e.g., injection scanner doesn't flag a payload)
  2. Incorrectly provided — the control acts wrong (e.g., trust system promotes an untrusted user)
  3. Provided at wrong time — the control acts too early or late (e.g., PII scan runs before decoding)
  4. Stopped too soon / applied too long — duration error (e.g., rate limiter resets prematurely)

Why STPA-Sec for AI Agent Security?

AI agents are control systems. The agent receives inputs (user messages, tool results, web content), processes them through a control structure (security proxy, LLM, tool permissions), and produces outputs (responses, tool calls, web requests). Traditional pentesting probes individual components; STPA-Sec reveals emergent vulnerabilities in the interactions between components.

The control structure for an AI agent security proxy:

USER ──message──> SECURITY PROXY ──filtered message──> LLM AGENT
                       │                                    │
                  [PII scan]                          [generates response]
                  [injection defense]                  [tool calls]
                  [trust check]                        [web fetches]
                       │                                    │
USER <──response── SECURITY PROXY <──raw response──── LLM AGENT
                       │
                  [PII scan outbound]
                  [audit log]

Unsafe control actions for this system:

Control Action Not Provided Incorrect Wrong Timing Wrong Duration
Filter inbound message Injection passes through Legitimate message blocked Scan runs after message forwarded
Check trust level Untrusted user gets elevated access Trusted user denied Trust checked before farming completes Trust decays too fast/slow
Scan web content Indirect injection enters context Clean content flagged Scan runs on intermediate redirect, not final page Scanner timeout on large pages
Redact PII Sensitive data exposed Non-PII redacted Redaction after response sent
Log to audit trail Attack unrecorded False positive logged as attack Log written after response (allows tampering window)

STPA-Sec Process (4 Steps)

Step 1: Define Losses

What must not happen? For an AI agent proxy:

  • L-1: Unauthorized data disclosure (PII, credentials, system prompts)
  • L-2: Unauthorized actions (tool calls, file writes, network requests)
  • L-3: Loss of agent integrity (context poisoning, trust manipulation)
  • L-4: Loss of audit integrity (undetected attacks)

Step 2: Identify Unsafe Control Actions

For each control action the proxy performs, enumerate the four ways it can be unsafe (see table above).

Step 3: Identify Loss Scenarios

For each unsafe control action, describe the causal scenario. Example:

UCA: Injection scanner does not flag a message containing a prompt injection. Scenario: Attacker encodes injection using Cyrillic homoglyphs that survive NFKC normalization. The 11 pattern detectors check against ASCII keyword patterns. The ensemble score stays below threshold. The message reaches the LLM unmodified.

Step 4: Identify Constraints and Mitigations

What must the system guarantee to prevent each loss scenario?

Constraint: The injection scanner must detect semantically equivalent instructions regardless of character encoding, including homoglyph substitution, zero-width characters, and variation selectors.

How This Assessment Uses STPA-Sec

Our red team assessment treats the STPA-Sec unsafe control actions as test cases. Each phase of testing maps to specific UCAs:

Phase UCAs Tested
0 (Recon) Determine which UCAs are in monitor vs. enforce mode
1 (Trust) UCAs related to trust checking — not provided, incorrect
2 (Injection) UCAs related to message filtering — not provided
3 (Indirect) UCAs related to web content scanning — not provided, wrong timing
4 (Exfiltration) UCAs related to PII redaction — not provided
5 (Chains) Multi-UCA scenarios combining findings
6 (Detection) UCAs related to audit logging — not provided (detection gaps)

Relationship to Other Research

This assessment also draws on recent agent security research:

  • "The Attacker Moves Second" (Anthropic, OpenAI, DeepMind et al., 2025) — Adaptive attacks defeat 12 published injection defenses with >90% success. Human red-teaming achieved 100% bypass.
  • "Agents Rule of Two" (Meta AI, 2025) — An agent must hold no more than 2 of: untrusted input, sensitive data, external action capability. Most personal AI assistants hold all 3.
  • Log-To-Leak (2025) — MCP-specific exfiltration via coerced tool invocation.
  • ToolHijacker (arXiv:2504.19793, 2025) — Tool selection manipulation achieving 96.7% attack success rate.
  • Phantom (arXiv:2602.16958, 2026) — Automated agent hijacking via Structural Template Injection.
  • CVE-2026-22708 — Cursor allowlist bypass via environment variable poisoning; relevant to any proxy that maintains tool-level allowlists.
  • Greshake et al. "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." AISec, 2023.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment