stvhay/stpa-sec-methodology.md

## stpa-sec-methodology.md

      
    Raw
  

              stpa-sec-methodology.md
            
          
    STPA-Sec: Systems-Theoretic Security Analysis for AI Agents

What is STPA?

Systems-Theoretic Process Analysis (STPA) is a hazard analysis method developed by Nancy Leveson at MIT. Unlike traditional methods that decompose systems into failure-prone components, STPA models systems as control structures and asks: under what conditions do control actions become unsafe?
Traditional safety analysis (fault trees, FMEA) assumes accidents result from component failures in a chain of events. STPA recognizes that accidents can occur without component failure — through unsafe interactions between components that individually function correctly. This matters for AI systems, where the LLM may behave exactly as designed yet produce unsafe outcomes.
Key references


Leveson, N.G. Engineering a Safer World: Systems Thinking Applied to Safety. MIT Press, 2012.
Young, W. & Leveson, N.G. "An Integrated Approach to Safety and Security Based on Systems Theory." Communications of the ACM, 57(2), 2014.
Leveson, N.G. & Thomas, J.P. STPA Handbook. MIT, 2018. Available at psas.scripts.mit.edu
Ayvali, E. "Systematic Hazard Analysis for Frontier AI using STPA." arXiv:2506.01782, 2025.

STPA-Sec: The Security Extension

STPA-Sec adapts STPA for security analysis. Where STPA asks "what control actions are unsafe?", STPA-Sec asks "what control actions are insecure?" — that is, what happens when an adversary manipulates the control structure?
The method treats security violations as losses (not just failures), and identifies how an attacker can cause the controller to issue control actions that are:

Not provided — the security control fails to act (e.g., injection scanner doesn't flag a payload)
Incorrectly provided — the control acts wrong (e.g., trust system promotes an untrusted user)
Provided at wrong time — the control acts too early or late (e.g., PII scan runs before decoding)
Stopped too soon / applied too long — duration error (e.g., rate limiter resets prematurely)

Why STPA-Sec for AI Agent Security?

AI agents are control systems. The agent receives inputs (user messages, tool results, web content), processes them through a control structure (security proxy, LLM, tool permissions), and produces outputs (responses, tool calls, web requests). Traditional pentesting probes individual components; STPA-Sec reveals emergent vulnerabilities in the interactions between components.
The control structure for an AI agent security proxy:

USER ──message──> SECURITY PROXY ──filtered message──> LLM AGENT
                       │                                    │
                  [PII scan]                          [generates response]
                  [injection defense]                  [tool calls]
                  [trust check]                        [web fetches]
                       │                                    │
USER <──response── SECURITY PROXY <──raw response──── LLM AGENT
                       │
                  [PII scan outbound]
                  [audit log]

Unsafe control actions for this system:


Control Action
Not Provided
Incorrect
Wrong Timing
Wrong Duration


Filter inbound message
Injection passes through
Legitimate message blocked
Scan runs after message forwarded
—


Check trust level
Untrusted user gets elevated access
Trusted user denied
Trust checked before farming completes
Trust decays too fast/slow


Scan web content
Indirect injection enters context
Clean content flagged
Scan runs on intermediate redirect, not final page
Scanner timeout on large pages


Redact PII
Sensitive data exposed
Non-PII redacted
Redaction after response sent
—


Log to audit trail
Attack unrecorded
False positive logged as attack
Log written after response (allows tampering window)
—


STPA-Sec Process (4 Steps)

Step 1: Define Losses

What must not happen? For an AI agent proxy:

L-1: Unauthorized data disclosure (PII, credentials, system prompts)
L-2: Unauthorized actions (tool calls, file writes, network requests)
L-3: Loss of agent integrity (context poisoning, trust manipulation)
L-4: Loss of audit integrity (undetected attacks)

Step 2: Identify Unsafe Control Actions

For each control action the proxy performs, enumerate the four ways it can be unsafe (see table above).
Step 3: Identify Loss Scenarios

For each unsafe control action, describe the causal scenario. Example:

UCA: Injection scanner does not flag a message containing a prompt injection.
Scenario: Attacker encodes injection using Cyrillic homoglyphs that survive NFKC normalization. The 11 pattern detectors check against ASCII keyword patterns. The ensemble score stays below threshold. The message reaches the LLM unmodified.

Step 4: Identify Constraints and Mitigations

What must the system guarantee to prevent each loss scenario?

Constraint: The injection scanner must detect semantically equivalent instructions regardless of character encoding, including homoglyph substitution, zero-width characters, and variation selectors.

How This Assessment Uses STPA-Sec

Our red team assessment treats the STPA-Sec unsafe control actions as test cases. Each phase of testing maps to specific UCAs:


Phase
UCAs Tested


0 (Recon)
Determine which UCAs are in monitor vs. enforce mode


1 (Trust)
UCAs related to trust checking — not provided, incorrect


2 (Injection)
UCAs related to message filtering — not provided


3 (Indirect)
UCAs related to web content scanning — not provided, wrong timing


4 (Exfiltration)
UCAs related to PII redaction — not provided


5 (Chains)
Multi-UCA scenarios combining findings


6 (Detection)
UCAs related to audit logging — not provided (detection gaps)


Relationship to Other Research

This assessment also draws on recent agent security research:

"The Attacker Moves Second" (Anthropic, OpenAI, DeepMind et al., 2025) — Adaptive attacks defeat 12 published injection defenses with >90% success. Human red-teaming achieved 100% bypass.
"Agents Rule of Two" (Meta AI, 2025) — An agent must hold no more than 2 of: untrusted input, sensitive data, external action capability. Most personal AI assistants hold all 3.
Log-To-Leak (2025) — MCP-specific exfiltration via coerced tool invocation.
ToolHijacker (arXiv:2504.19793, 2025) — Tool selection manipulation achieving 96.7% attack success rate.
Phantom (arXiv:2602.16958, 2026) — Automated agent hijacking via Structural Template Injection.
CVE-2026-22708 — Cursor allowlist bypass via environment variable poisoning; relevant to any proxy that maintains tool-level allowlists.
Greshake et al. "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." AISec, 2023.
Control Action	Not Provided	Incorrect	Wrong Timing	Wrong Duration
Filter inbound message	Injection passes through	Legitimate message blocked	Scan runs after message forwarded	—
Check trust level	Untrusted user gets elevated access	Trusted user denied	Trust checked before farming completes	Trust decays too fast/slow
Scan web content	Indirect injection enters context	Clean content flagged	Scan runs on intermediate redirect, not final page	Scanner timeout on large pages
Redact PII	Sensitive data exposed	Non-PII redacted	Redaction after response sent	—
Log to audit trail	Attack unrecorded	False positive logged as attack	Log written after response (allows tampering window)	—
Phase	UCAs Tested
0 (Recon)	Determine which UCAs are in monitor vs. enforce mode
1 (Trust)	UCAs related to trust checking — not provided, incorrect
2 (Injection)	UCAs related to message filtering — not provided
3 (Indirect)	UCAs related to web content scanning — not provided, wrong timing
4 (Exfiltration)	UCAs related to PII redaction — not provided
5 (Chains)	Multi-UCA scenarios combining findings
6 (Detection)	UCAs related to audit logging — not provided (detection gaps)