Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active February 24, 2026 19:43
Show Gist options
  • Select an option

  • Save arubis/f08eeeccdac6ca9ba5b4de306138839a to your computer and use it in GitHub Desktop.

Select an option

Save arubis/f08eeeccdac6ca9ba5b4de306138839a to your computer and use it in GitHub Desktop.
Hardening 'Ephemeral Preview Environments' task — v2, based on v110 transcript analysis

Hardening "Ephemeral Preview Environments" — Architectural Guidance (v2)

Task: 4c070240-661d-44f3-b056-a612f8fc7804 (ephemeral-environments) Analyzed version: v110 (8 completed biggie-nebula runs) Current state: 97.9% average score (7× perfect 1.0, 1× 0.833) Target: <70% pass rate

v2 note: This revision is based on analysis of the actual v110 task files and full evaluation transcripts. The prior version (v1) was based solely on Discord thread context and contained incorrect assumptions about the pre-built image.

What We Found

We analyzed all 8 transcripts from v110. Every single passing agent follows the same playbook:

  1. Discovers the environment — Gitea creds, Harbor registry, pip cache at /var/cache/pip-offline/
  2. Writes a polling scriptwhile true loop checking Gitea API every 5-15s
  3. Clones the PR branch — Gets BUILD_MARKER.txt into the build context
  4. Writes a ~30-line stub FastAPI app — Hardcoded /health and /info endpoints
  5. Builds a Docker image from stub + pip cache — Uses pip install --no-index --find-links
  6. Deploys with correct K8s YAML — Probes, limits, labels, NetworkPolicy, Ingress
  7. Posts PR comment — One Gitea API call

The single failure (run 2, score 0.833) was a cleanup timing issue — 10s polling interval too slow for the 120s deletion timeout. No agent struggled with the actual task design.

The Core Problem

The task has two primary exploit vectors that make it trivially solvable:

Exploit 1: The Stub App Bypass

The grader checks /health and /info endpoint responses, but never verifies the deployed container runs actual bleater-app code. All 8 agents exploit this identically:

# What agents deploy (a 30-line stub):
@app.get("/info")
async def info():
    return {"service": "bleater-preview", "version": "preview"}

The grader's check ("bleater" in service_name.lower()) is trivially satisfied. The agent never needs to understand, build, or deploy the real api-gateway service.

Exploit 2: The Polling Shortcut

The task says "the system must run continuously and handle PRs as they come in" — agents interpret this as a bash while true loop polling Gitea every few seconds. This is:

  • A pattern LLMs are extremely comfortable with
  • Much simpler than real event-driven architecture (webhooks, Gitea Actions)
  • Not how production preview environments work

The Approach: Layered Complexity

Create difficulty across orthogonal skill axes so agents can independently succeed or fail at different challenges. See:

File Contents
diagnosis.md Evidence from v110 transcripts: why current checks don't create difficulty
approach.md Three independent failure mechanisms with projected outcomes
implementation-guide.md Specific changes to grader.py, setup.sh, solution.sh, task.yaml

What's Already Been Tried (and Why It Didn't Work)

Change Version Result Why
Remove pre-built FastAPI image ≤v103 No effect Agents build stub images from pip cache
Add NetworkPolicy check ~v105 No effect Same skill axis as other YAML
Add /info endpoint validation ~v107 No effect Agents hardcode "service": "bleater-preview"
Add BUILD_MARKER verification ~v89 Slight effect initially Agents clone repo → COPY . . → marker included for free
Rewrite task.yaml as story format v104 0% pass rate Removed too many implementation hints simultaneously
Add probes + resource limits check Various No effect Standard K8s YAML pattern

Approach: Three Independent Failure Mechanisms

Philosophy

The current task has a single difficulty axis: "can the agent set up K8s YAML + a shell polling script?" We need orthogonal axes — challenges that exercise fundamentally different cognitive skills, so an agent can independently succeed or fail at each one.

The Three Axes

Axis 1: Real Application Deployment (closes the stub bypass)

Current state: Agents deploy a 30-line FastAPI stub. The grader never checks whether actual bleater-app code is running.

Change: The grader verifies the deployed container contains and runs actual bleater-app code — specifically the api-gateway service from the repository.

Skills required:

  • Navigate and understand the bleater-app repository structure
  • Identify that api-gateway/ is the service to deploy
  • Handle the api-gateway's actual dependencies (beyond just fastapi/uvicorn)
  • Configure enough environment variables for the app to start
  • Debug application startup failures

Why this is independently hard: This is codebase comprehension — a completely different skill from K8s YAML generation. The agent must read api-gateway/main.py, understand its imports and dependencies, figure out why it fails to start (missing env vars, missing shared modules), and fix the issues. Historical data shows agents frequently struggle with multi-file Python projects that have shared dependencies.

Gated checks: preview_accessible (app must serve real endpoints)


Axis 2: Event-Driven PR Handling (closes the polling shortcut)

Current state: All agents write while true; do curl gitea; sleep 5; done polling loops.

Change: Require webhook-based or Gitea Actions-based PR event handling. The grader either:

  • Verifies a Gitea webhook exists pointing to an agent-created receiver, OR
  • Verifies a .gitea/workflows/ file exists and ran, OR
  • Uses timing-based verification (preview must appear within a tight window that's faster than polling can achieve)

Skills required:

  • HTTP server design (listen on port, parse webhook JSON payload)
  • Gitea webhook configuration (API calls to register hooks)
  • Process lifecycle management (server must stay running, handle concurrent requests)
  • OR: Gitea Actions workflow syntax + runner configuration

Why this is independently hard: Event-driven architecture requires fundamentally different thinking than imperative scripting. The agent must design a system that reacts to events rather than polls for changes. This involves understanding webhook delivery, request handling, and service networking — software architecture skills, not infrastructure configuration.

Gated checks: All PR lifecycle checks (entire flow depends on receiving events)


Axis 3: Airgapped Dependency Resolution (already partially present, needs hardening)

Current state: The pip cache at /var/cache/pip-offline/ has only fastapi and uvicorn — exactly what's needed for a stub app. If we require the real api-gateway, the agent needs more packages.

Change: The real api-gateway has dependencies beyond fastapi/uvicorn (httpx, sentry-sdk, prometheus-client, etc.). The pip cache should contain these packages, but the agent must:

  • Read the actual requirements.txt
  • Figure out which packages are available in the cache vs which need alternatives
  • Handle shared module imports (from shared.auth.auth import ...)

Skills required:

  • Python dependency management
  • Reading and understanding requirements.txt
  • Debugging import errors in airgapped environments
  • Possibly modifying the application to work with available packages

Why this is independently hard: This is dependency resolution — a debugging/troubleshooting skill. The agent must iterate through error → fix → retry cycles, which is time-consuming and error-prone for LLMs.

Gated checks: preview_deploys (image must build successfully), preview_accessible (app must start)


Projected Impact

Current State (single axis, v110)

Agent understands the pattern? ──Yes (97.9%)──► Score: ~1.0
                                └──No (2.1%)───► Score: ~0.83 (only cleanup timing)

With Three Axes

Scenario Real App Events Deps Est. Score Probability
Nails everything ~1.0 ~10-20%
Fails real app only 1 ~0.33 ~15-25%
Fails events only ~0.50 ~10-15%
Fails deps only ~0.67 ~10-15%
Fails app + events ~0.0-0.17 ~10-15%
Partial on multiple 🔶 🔶 🔶 ~0.3-0.5 ~15-25%

Projected pass rate: ~10-20% (down from 97.9%) Projected mean score: ~0.30-0.45 (down from ~0.98)

This is well below the <70% target, with room to tune up if needed.

Implementation Priority

Priority Change Expected Impact Effort Risk
P0 Close stub app bypass (grader verifies real code) -40-60% pass rate Medium Low — solution.sh already has the right pattern, just needs to actually deploy real code
P1 Require event-driven flow -20-30% pass rate High Medium — must ensure at least one event path works in the environment
P2 Expand pip cache + require real deps -10-15% pass rate Low Low — additive to P0

Start with P0 alone, eval, and see where the pass rate lands. P0 might be sufficient by itself. P1 is the hardest to implement and highest risk of making the task unsolvable.

Footnotes

  1. If real app deployment fails, some downstream checks also fail (correlated), which is acceptable as long as other axes remain independent.

Diagnosis: Evidence from v110 Transcripts

The Standard Agent Playbook

All 8 agents in v110 follow essentially the same approach, with minor variations:

graph LR
    A[Read task.yaml] --> B[Explore environment]
    B --> C[Find pip cache + Harbor]
    C --> D[Write polling script]
    D --> E[Write stub FastAPI app]
    E --> F[Write Dockerfile]
    F --> G[Deploy K8s resources]
    G --> H[Post PR comment]
Loading

Per-Run Summary

Run Score Approach App Type Monitoring Failed Check
1 1.0 Clone repo + write stub main.py Stub Polling (5s)
2 0.833 Clone repo + write stub main.py Stub Polling (10s) cleanup_works (timing)
3 1.0 Clone repo + write stub main.py Stub Polling (5s)
4 1.0 Clone repo + write stub main.py Stub Polling (5s)
5 1.0 Clone repo + write stub main.py Stub Polling (5s)
6 1.0 Clone repo + write stub main.py Stub Polling (5s)
7 1.0 Clone repo + write stub main.py Stub Polling (5s)
8 1.0 Clone repo + write stub main.py Stub Polling (5s)

Every single agent deploys a stub, not the real bleater-app.

The Stub App Pattern

Each agent's stub looks approximately like this:

from fastapi import FastAPI
app = FastAPI(title="Bleater Preview")

BUILD_MARKER = ""
try:
    with open("/app/BUILD_MARKER.txt", "r") as f:
        BUILD_MARKER = f.read().strip()
except:
    BUILD_MARKER = "unknown"

@app.get("/health")
async def health():
    return {"status": "healthy"}

@app.get("/info")
async def info():
    return {
        "service": "bleater-preview",     # <-- Passes "bleater" in service check
        "version": "preview",              # <-- Passes required field check
        "build_marker": BUILD_MARKER
    }

This is ~30 lines. The agent doesn't need to:

  • Read or understand the actual api-gateway code
  • Figure out dependency requirements beyond fastapi/uvicorn
  • Configure database connections or service routing
  • Debug application startup issues

Why Existing Checks Don't Create Difficulty

The grader's 6 subscores all test variations of the same skill:

graph TD
    subgraph "Single Skill: K8s YAML + Shell Scripting"
        K8s[Write K8s manifests<br>+ shell automation]
        K8s --> preview_deploys[preview_deploys<br>namespace + deployment + svc + ingress]
        K8s --> preview_accessible[preview_accessible<br>/health + /info endpoints]
        K8s --> pr_comment[pr_comment<br>curl Gitea API]
        K8s --> concurrent[concurrent_prs<br>same script handles N PRs]
        K8s --> cleanup[cleanup_works<br>kubectl delete namespace]
        K8s --> network[network_isolated<br>NetworkPolicy YAML]
    end
Loading

Once the agent has the polling script + K8s deployment pattern working, all 6 checks pass together. There's no path to scoring 0.3 or 0.5 — it's all-or-nothing (confirmed by the version history: scores cluster around 0.0 or 0.97+).

The Build Verification Bypass

The verify_pr_build() function in the grader checks three things:

  1. Image isn't the raw base python image → Agent names it harbor.devops.local/bleater/preview:pr-N
  2. Image tag contains PR number → Agent includes pr-{PR_NUM} in the tag ✓
  3. BUILD_MARKER.txt exists in container → Agent clones repo (COPY . . includes it) ✓

None of these verify that actual application code is running. The agent satisfies all three by:

  • Cloning the repo (gets BUILD_MARKER.txt)
  • Writing a stub main.py that overwrites/ignores the real app
  • Building with a custom tag

The Zero-Build Exploit (Run 4)

One agent (run 4) never calls docker build at all. When Kaniko fails (airgapped), it discovers that Harbor's REST API supports copying images between repositories. It creates bleater-preview/bleater-app:pr-N-SHA by copying library/python:3.11-slim via the Harbor API, then mounts code from ConfigMaps and the pip cache via hostPath volumes. The image is literally the unmodified base Python image with a new tag — and it scores 1.0.

This demonstrates that verify_pr_build checks tag format and marker file presence, but cannot detect whether the image was actually built from source.

The Polling Discovery

The task says "The system must run continuously and handle PRs as they come in." Every agent interprets this as:

while true; do
    # Poll Gitea API for open PRs
    OPEN_PRS=$(curl -s "${GITEA_API}/repos/root/bleater-app/pulls?state=open" | ...)
    for PR in $OPEN_PRS; do
        # Deploy if not already deployed
        ...
    done
    # Check for closed PRs, cleanup
    ...
    sleep 5
done

This is a well-known LLM pattern (bash polling loops appear frequently in training data). It requires no understanding of event-driven architecture, webhook payloads, or service lifecycle management.

Version History Pattern

The task's score history shows the cliff problem clearly:

v104: 0.0   (removed hints → agents can't start)
v105: 0.19  (added back minimal hints)
v106: 0.77  (found the sweet spot briefly)
v107: 0.88  (slightly more hints → back to easy)
v108: 0.87  (stable at too-easy)
v109: 0.96  (still too easy)
v110: 0.98  (nearly perfect)

There's no stable middle ground. The task is either impossible (missing hints about infrastructure) or trivially solvable (once the agent has enough context to start, it aces everything).

Implementation Guide

Specific changes to each task file, ordered by priority.


P0: Close the Stub App Bypass

This is the single highest-impact change. The grader must verify that the deployed container runs actual bleater-app code, not a 30-line FastAPI stub.

What Agents Currently Do

All 8 agents in v110 follow the same pattern:

  1. Clone the PR branch (gets BUILD_MARKER.txt)
  2. Write their own main.py with hardcoded /health and /info responses
  3. Build an image with just fastapi + uvicorn from the pip cache
  4. The real api-gateway code is present in the build context but completely ignored

grader.py Changes

Add a function that verifies the deployed container contains the actual api-gateway application, not a stub. The key insight is that the real api-gateway has characteristics a stub doesn't:

def verify_real_bleater_app(namespace, pr_number):
    """Verify the deployed container runs actual bleater-app code, not a stub."""
    code, pod_name, _ = run_cmd(
        f"kubectl get pod -n {namespace} -o jsonpath='{{.items[0].metadata.name}}'"
    )
    if code != 0 or not pod_name:
        return False, "Cannot find pod to verify application code"

    # Check 1: The real api-gateway imports shared modules
    # (a stub app won't have these)
    code, out, _ = run_cmd(
        f"kubectl exec -n {namespace} {pod_name} -- "
        f"python3 -c \"import importlib.util; "
        f"print(importlib.util.find_spec('shared') is not None or "
        f"importlib.util.find_spec('shared.auth') is not None)\"",
        timeout=15
    )
    has_shared = code == 0 and "True" in out

    # Check 2: The real api-gateway defines service routing
    # (it proxies to authentication-service, bleat-service, etc.)
    code, out, _ = run_cmd(
        f"kubectl exec -n {namespace} {pod_name} -- "
        f"grep -rl 'authentication-service\\|bleat-service\\|profile-service' /app/ 2>/dev/null | head -1",
        timeout=15
    )
    has_service_routing = code == 0 and out.strip() != ""

    # Check 3: The app exposes more than just /health and /info
    # (real api-gateway has /api/v1/* routes)
    code, out, _ = run_cmd(
        f"kubectl exec -n {namespace} {pod_name} -- "
        f"grep -rl 'api/v1\\|/bleats\\|/users\\|/auth' /app/ 2>/dev/null | head -1",
        timeout=15
    )
    has_api_routes = code == 0 and out.strip() != ""

    if not (has_shared or has_service_routing or has_api_routes):
        return False, (
            "Deployed app appears to be a stub, not the actual bleater api-gateway. "
            "The preview must deploy the real application code from the repository."
        )

    return True, "Verified actual bleater-app code is deployed"

Then integrate into the grading flow after verify_pr_build():

    # 5d. Verify actual bleater-app code (not a stub)
    real_app_ok = False
    success, msg = verify_real_bleater_app(ns, pr_num)
    if success:
        feedback_parts.append(f"✓ {msg}")
        real_app_ok = True
    else:
        feedback_parts.append(f"✗ {msg}")

    # preview_deploys now requires namespace resources + PR build + real app
    if ns_resources_ok and pr_build_ok and real_app_ok:
        subscores["preview_deploys"] = 1.0

Calibration note: Test this against the actual bleater-app api-gateway running in a container. The checks should detect patterns present in the real code but absent from any reasonable stub. If the real api-gateway's shared module isn't importable standalone (it may need database connections), use the file-based checks (grep) instead of import checks.

solution.sh Changes

The current solution.sh (line 73) writes a stub main.py. Replace this with deploying the actual api-gateway:

# Instead of writing a stub main.py, use the actual api-gateway code
# The key change: DON'T overwrite with a stub. Deploy the real app.

cat > Dockerfile.preview <<'DOCKERFILE'
FROM harbor.devops.local/library/python:3.11-slim
WORKDIR /app

# Install dependencies from offline cache
COPY pip-offline /tmp/pip-offline
RUN pip install --no-index --find-links=/tmp/pip-offline \
    fastapi uvicorn httpx python-multipart && \
    rm -rf /tmp/pip-offline

# Copy the actual api-gateway code from the repo
COPY api-gateway/ /app/
COPY shared/ /app/shared/

# Copy BUILD_MARKER
COPY BUILD_MARKER.txt /app/BUILD_MARKER.txt

# The real api-gateway needs these env vars to start
# (it will fail to connect to backends, but that's OK —
# the health endpoint should still work)
ENV PYTHONPATH="/app"
ENV ENVIRONMENT="preview"
ENV AUTH_SERVICE_URL="http://localhost:8001"
ENV BLEAT_SERVICE_URL="http://localhost:8003"
ENV PROFILE_SERVICE_URL="http://localhost:8002"

EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
DOCKERFILE

Important: You'll need to test whether the real api-gateway starts with dummy backend URLs. It may need a small wrapper or environment configuration to handle missing backends gracefully. The /health and /info endpoints should work without backends; API routes that proxy to other services can return errors — that's fine.

Dockerfile (task Dockerfile) Changes

Expand the pip cache to include the api-gateway's actual dependencies:

RUN mkdir -p /var/cache/pip-offline && \
    pip3 download --dest /var/cache/pip-offline \
    --platform manylinux_2_17_x86_64 \
    --python-version 3.11 \
    --implementation cp \
    --abi cp311 \
    --only-binary=:all: \
    fastapi==0.104.1 uvicorn==0.24.0 \
    httpx python-multipart pydantic \
    # Add whatever the real api-gateway needs:
    # Check api-gateway/requirements.txt for the full list

Note: Some packages may not have binary wheels for the specified platform. You may need to also download pure-python packages with --no-binary=:all: as a fallback, or pin specific versions known to have wheels available.

task.yaml Changes

Remove the explicit pip cache path. The current Constraints section includes:

A base Python image is available at harbor.devops.local/library/python:3.11-slim.

And setup.sh leaves a README at /var/cache/pip-offline/. But transcript analysis shows every agent reads the pip cache location from the prompt within their first 3-5 messages — there's zero discovery challenge. Change to:

# Current:
#   * This is an airgapped environment with no internet access. All required
#     container images and dependencies must be sourced from local registries
#     and caches. A base Python image is available at
#     `harbor.devops.local/library/python:3.11-slim`.

# Proposed:
#   * This is an airgapped environment with no internet access. All required
#     container images and dependencies must be sourced from local registries
#     and caches.

Keep the Harbor registry URL (it's in the [Context] section and agents need it for docker push), but remove the specific base image path. The agent should discover what images are available in Harbor.

Also optionally make the /info hint slightly more specific about deploying real code:

# Current:
#   * `/info`: should return HTTP 200 with JSON metadata about the application,
#     including at least `service` and `version` fields.
#     The `service` value must identify a Bleater service.

# Consider changing to:
#   * `/info`: should return HTTP 200 with JSON metadata from the deployed
#     application. The response must reflect the actual service running
#     (e.g., the api-gateway's real service identity and version).

This doesn't reveal the grading criteria but makes "deploy the real app" slightly more implied.


P1: Require Event-Driven PR Handling

Only implement if P0 doesn't get the pass rate low enough on its own.

Option A: Gitea Webhooks (Recommended)

setup.sh Changes

Pre-register a webhook in Gitea that points to a URL the agent must implement:

# Wait for Gitea API to be ready, then register webhook
GITEA_TOKEN=$(curl -s -X POST "http://root:Admin%40123456@gitea.devops.local/api/v1/users/root/tokens" \
  -H "Content-Type: application/json" \
  -d '{"name": "webhook-setup"}' | jq -r '.sha1')

curl -s -X POST "http://gitea.devops.local/api/v1/repos/root/bleater-app/hooks" \
  -H "Authorization: token ${GITEA_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "gitea",
    "config": {
      "url": "http://preview-controller.default.svc.cluster.local:9000/webhook",
      "content_type": "json"
    },
    "events": ["pull_request"],
    "active": true
  }'

grader.py Changes

Add a timing-based check that the preview appeared fast enough to be webhook-driven:

def check_event_driven(pr_created_time, ns_appeared_time):
    """Verify preview creation was event-driven, not polling."""
    response_time = ns_appeared_time - pr_created_time

    # Webhook should trigger within ~15s (network + build time)
    # A 5-second polling loop would take 5s average + build time
    # A 10-second loop would take 10s average + build time
    # We can't reliably distinguish polling from webhooks by timing alone,
    # so also check for webhook infrastructure:

    # Check if a webhook receiver service/pod exists
    code, out, _ = run_cmd(
        "kubectl get pod -A -l app=preview-controller --no-headers 2>/dev/null || "
        "kubectl get pod -A -l app=webhook-receiver --no-headers 2>/dev/null || "
        "kubectl get pod -A -l component=preview-webhook --no-headers 2>/dev/null"
    )
    has_webhook_pod = code == 0 and out.strip() != ""

    # Check Gitea webhooks
    try:
        resp = requests.get(
            f"{GITEA_API}/repos/root/bleater-app/hooks",
            auth=GITEA_AUTH,
            timeout=10
        )
        hooks = resp.json() if resp.status_code == 200 else []
        has_active_webhook = any(
            h.get("active") and "pull_request" in h.get("events", [])
            for h in hooks
        )
    except:
        has_active_webhook = False

    if not has_active_webhook:
        return False, "No active pull_request webhook found in Gitea"

    return True, "Event-driven webhook flow verified"

Important: Timing-based checks alone are unreliable. Combine with infrastructure checks (webhook exists + receiver pod exists). The agent must both register a webhook AND have something listening for it.

Option B: Gitea Actions (Simpler Alternative)

If webhook infrastructure is too complex to set up reliably, require a .gitea/workflows/preview.yaml file instead. This is still event-driven but uses a framework.

Warning: The Gitea Actions runner uses catthehacker/ubuntu:act-latest which isn't available in the airgapped environment. If going this route, setup.sh must pre-push a working runner image to Harbor and reconfigure the runner. This is a significant setup.sh change.


P2: Expand Dependency Challenge

This is the lowest-effort change and naturally pairs with P0.

Dockerfile Changes

Add the api-gateway's actual dependencies to the pip cache:

# Read api-gateway/requirements.txt and cache ALL needed packages
RUN pip3 download --dest /var/cache/pip-offline \
    --platform manylinux_2_17_x86_64 \
    --python-version 3.11 \
    --implementation cp \
    --abi cp311 \
    --only-binary=:all: \
    fastapi==0.104.1 uvicorn==0.24.0 \
    httpx sentry-sdk python-multipart \
    prometheus-client pydantic

The pip cache should contain enough packages to run the api-gateway, but the agent must:

  1. Read api-gateway/requirements.txt to know what's needed
  2. Discover which packages are available in the cache
  3. Handle any missing packages (maybe mock them or find alternatives)

This adds a dependency resolution challenge that's independent of K8s skills.


Incremental Testing Strategy

flowchart TD
    A["Current: 97.9% avg score"] --> B["Apply P0: Close stub app bypass"]
    B --> C{"Eval 8 runs biggie-nebula"}
    C -->|">70%"| D["Apply P1: Event-driven flow"]
    C -->|"30-70%"| E["Ship it ✅"]
    C -->|"<15%"| F["Relax app verification<br>(check fewer code patterns)"]
    D --> G{"Eval 8 runs biggie-nebula"}
    G -->|">70%"| H["Something is wrong —<br>investigate transcripts"]
    G -->|"30-70%"| E
    G -->|"<15%"| I["Use Option B (Actions)<br>or relax timing checks"]
Loading

Key principle: Apply one change at a time and eval between each. P0 alone might be sufficient — it targets the exact exploit all 8 agents use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment