arubis/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Hardening "Ephemeral Preview Environments" — Architectural Guidance (v2)

Task: 4c070240-661d-44f3-b056-a612f8fc7804 (ephemeral-environments)
Analyzed version: v110 (8 completed biggie-nebula runs)
Current state: 97.9% average score (7× perfect 1.0, 1× 0.833)
Target: <70% pass rate

v2 note: This revision is based on analysis of the actual v110 task files and full evaluation transcripts. The prior version (v1) was based solely on Discord thread context and contained incorrect assumptions about the pre-built image.

What We Found

We analyzed all 8 transcripts from v110. Every single passing agent follows the same playbook:

Discovers the environment — Gitea creds, Harbor registry, pip cache at /var/cache/pip-offline/
Writes a polling script — while true loop checking Gitea API every 5-15s
Clones the PR branch — Gets BUILD_MARKER.txt into the build context
Writes a ~30-line stub FastAPI app — Hardcoded /health and /info endpoints
Builds a Docker image from stub + pip cache — Uses pip install --no-index --find-links
Deploys with correct K8s YAML — Probes, limits, labels, NetworkPolicy, Ingress
Posts PR comment — One Gitea API call

The single failure (run 2, score 0.833) was a cleanup timing issue — 10s polling interval too slow for the 120s deletion timeout. No agent struggled with the actual task design.
The Core Problem

The task has two primary exploit vectors that make it trivially solvable:
Exploit 1: The Stub App Bypass

The grader checks /health and /info endpoint responses, but never verifies the deployed container runs actual bleater-app code. All 8 agents exploit this identically:
# What agents deploy (a 30-line stub):
@app.get("/info")
async def info():
    return {"service": "bleater-preview", "version": "preview"}
The grader's check ("bleater" in service_name.lower()) is trivially satisfied. The agent never needs to understand, build, or deploy the real api-gateway service.
Exploit 2: The Polling Shortcut

The task says "the system must run continuously and handle PRs as they come in" — agents interpret this as a bash while true loop polling Gitea every few seconds. This is:

A pattern LLMs are extremely comfortable with
Much simpler than real event-driven architecture (webhooks, Gitea Actions)
Not how production preview environments work

The Approach: Layered Complexity

Create difficulty across orthogonal skill axes so agents can independently succeed or fail at different challenges. See:


File
Contents


diagnosis.md
Evidence from v110 transcripts: why current checks don't create difficulty


approach.md
Three independent failure mechanisms with projected outcomes


implementation-guide.md
Specific changes to grader.py, setup.sh, solution.sh, task.yaml


What's Already Been Tried (and Why It Didn't Work)


Change
Version
Result
Why


Remove pre-built FastAPI image
≤v103
No effect
Agents build stub images from pip cache


Add NetworkPolicy check
~v105
No effect
Same skill axis as other YAML


Add /info endpoint validation
~v107
No effect
Agents hardcode "service": "bleater-preview"


Add BUILD_MARKER verification
~v89
Slight effect initially
Agents clone repo → COPY . . → marker included for free


Rewrite task.yaml as story format
v104
0% pass rate
Removed too many implementation hints simultaneously


Add probes + resource limits check
Various
No effect
Standard K8s YAML pattern


## approach.md

      
    Raw
  

              approach.md
            
          
    Approach: Three Independent Failure Mechanisms

Philosophy

The current task has a single difficulty axis: "can the agent set up K8s YAML + a shell polling script?" We need orthogonal axes — challenges that exercise fundamentally different cognitive skills, so an agent can independently succeed or fail at each one.
The Three Axes

Axis 1: Real Application Deployment (closes the stub bypass)

Current state: Agents deploy a 30-line FastAPI stub. The grader never checks whether actual bleater-app code is running.
Change: The grader verifies the deployed container contains and runs actual bleater-app code — specifically the api-gateway service from the repository.
Skills required:

Navigate and understand the bleater-app repository structure
Identify that api-gateway/ is the service to deploy
Handle the api-gateway's actual dependencies (beyond just fastapi/uvicorn)
Configure enough environment variables for the app to start
Debug application startup failures

Why this is independently hard: This is codebase comprehension — a completely different skill from K8s YAML generation. The agent must read api-gateway/main.py, understand its imports and dependencies, figure out why it fails to start (missing env vars, missing shared modules), and fix the issues. Historical data shows agents frequently struggle with multi-file Python projects that have shared dependencies.
Gated checks: preview_accessible (app must serve real endpoints)

Axis 2: Event-Driven PR Handling (closes the polling shortcut)

Current state: All agents write while true; do curl gitea; sleep 5; done polling loops.
Change: Require webhook-based or Gitea Actions-based PR event handling. The grader either:

Verifies a Gitea webhook exists pointing to an agent-created receiver, OR
Verifies a .gitea/workflows/ file exists and ran, OR
Uses timing-based verification (preview must appear within a tight window that's faster than polling can achieve)

Skills required:

HTTP server design (listen on port, parse webhook JSON payload)
Gitea webhook configuration (API calls to register hooks)
Process lifecycle management (server must stay running, handle concurrent requests)
OR: Gitea Actions workflow syntax + runner configuration

Why this is independently hard: Event-driven architecture requires fundamentally different thinking than imperative scripting. The agent must design a system that reacts to events rather than polls for changes. This involves understanding webhook delivery, request handling, and service networking — software architecture skills, not infrastructure configuration.
Gated checks: All PR lifecycle checks (entire flow depends on receiving events)

Axis 3: Airgapped Dependency Resolution (already partially present, needs hardening)

Current state: The pip cache at /var/cache/pip-offline/ has only fastapi and uvicorn — exactly what's needed for a stub app. If we require the real api-gateway, the agent needs more packages.
Change: The real api-gateway has dependencies beyond fastapi/uvicorn (httpx, sentry-sdk, prometheus-client, etc.). The pip cache should contain these packages, but the agent must:

Read the actual requirements.txt
Figure out which packages are available in the cache vs which need alternatives
Handle shared module imports (from shared.auth.auth import ...)

Skills required:

Python dependency management
Reading and understanding requirements.txt
Debugging import errors in airgapped environments
Possibly modifying the application to work with available packages

Why this is independently hard: This is dependency resolution — a debugging/troubleshooting skill. The agent must iterate through error → fix → retry cycles, which is time-consuming and error-prone for LLMs.
Gated checks: preview_deploys (image must build successfully), preview_accessible (app must start)

Projected Impact

Current State (single axis, v110)

Agent understands the pattern? ──Yes (97.9%)──► Score: ~1.0
                                └──No (2.1%)───► Score: ~0.83 (only cleanup timing)

With Three Axes


Scenario
Real App
Events
Deps
Est. Score
Probability


Nails everything
✅
✅
✅
~1.0
~10-20%


Fails real app only
❌
✅
❌¹
~0.33
~15-25%


Fails events only
✅
❌
✅
~0.50
~10-15%


Fails deps only
✅
✅
❌
~0.67
~10-15%


Fails app + events
❌
❌
❌
~0.0-0.17
~10-15%


Partial on multiple
🔶
🔶
🔶
~0.3-0.5
~15-25%


Projected pass rate: ~10-20% (down from 97.9%)
Projected mean score: ~0.30-0.45 (down from ~0.98)
This is well below the <70% target, with room to tune up if needed.
Implementation Priority


Priority
Change
Expected Impact
Effort
Risk


P0
Close stub app bypass (grader verifies real code)
-40-60% pass rate
Medium
Low — solution.sh already has the right pattern, just needs to actually deploy real code


P1
Require event-driven flow
-20-30% pass rate
High
Medium — must ensure at least one event path works in the environment


P2
Expand pip cache + require real deps
-10-15% pass rate
Low
Low — additive to P0


Start with P0 alone, eval, and see where the pass rate lands. P0 might be sufficient by itself. P1 is the hardest to implement and highest risk of making the task unsolvable.
Footnotes


If real app deployment fails, some downstream checks also fail (correlated), which is acceptable as long as other axes remain independent. ↩


## diagnosis.md

      
    Raw
  

              diagnosis.md
            
          
    Diagnosis: Evidence from v110 Transcripts

The Standard Agent Playbook

All 8 agents in v110 follow essentially the same approach, with minor variations:

  
      graph LR
    A[Read task.yaml] --> B[Explore environment]
    B --> C[Find pip cache + Harbor]
    C --> D[Write polling script]
    D --> E[Write stub FastAPI app]
    E --> F[Write Dockerfile]
    F --> G[Deploy K8s resources]
    G --> H[Post PR comment]

    
      Loading

  
Per-Run Summary


Run
Score
Approach
App Type
Monitoring
Failed Check


1
1.0
Clone repo + write stub main.py
Stub
Polling (5s)
—


2
0.833
Clone repo + write stub main.py
Stub
Polling (10s)
cleanup_works (timing)


3
1.0
Clone repo + write stub main.py
Stub
Polling (5s)
—


4
1.0
Clone repo + write stub main.py
Stub
Polling (5s)
—


5
1.0
Clone repo + write stub main.py
Stub
Polling (5s)
—


6
1.0
Clone repo + write stub main.py
Stub
Polling (5s)
—


7
1.0
Clone repo + write stub main.py
Stub
Polling (5s)
—


8
1.0
Clone repo + write stub main.py
Stub
Polling (5s)
—


Every single agent deploys a stub, not the real bleater-app.
The Stub App Pattern

Each agent's stub looks approximately like this:
from fastapi import FastAPI
app = FastAPI(title="Bleater Preview")

BUILD_MARKER = ""
try:
    with open("/app/BUILD_MARKER.txt", "r") as f:
        BUILD_MARKER = f.read().strip()
except:
    BUILD_MARKER = "unknown"

@app.get("/health")
async def health():
    return {"status": "healthy"}

@app.get("/info")
async def info():
    return {
        "service": "bleater-preview",     # <-- Passes "bleater" in service check
        "version": "preview",              # <-- Passes required field check
        "build_marker": BUILD_MARKER
    }
This is ~30 lines. The agent doesn't need to:

Read or understand the actual api-gateway code
Figure out dependency requirements beyond fastapi/uvicorn
Configure database connections or service routing
Debug application startup issues

Why Existing Checks Don't Create Difficulty

The grader's 6 subscores all test variations of the same skill:

  
      graph TD
    subgraph "Single Skill: K8s YAML + Shell Scripting"
        K8s[Write K8s manifests<br>+ shell automation]
        K8s --> preview_deploys[preview_deploys<br>namespace + deployment + svc + ingress]
        K8s --> preview_accessible[preview_accessible<br>/health + /info endpoints]
        K8s --> pr_comment[pr_comment<br>curl Gitea API]
        K8s --> concurrent[concurrent_prs<br>same script handles N PRs]
        K8s --> cleanup[cleanup_works<br>kubectl delete namespace]
        K8s --> network[network_isolated<br>NetworkPolicy YAML]
    end

    
      Loading

  
Once the agent has the polling script + K8s deployment pattern working, all 6 checks pass together. There's no path to scoring 0.3 or 0.5 — it's all-or-nothing (confirmed by the version history: scores cluster around 0.0 or 0.97+).
The Build Verification Bypass

The verify_pr_build() function in the grader checks three things:

Image isn't the raw base python image → Agent names it harbor.devops.local/bleater/preview:pr-N ✓
Image tag contains PR number → Agent includes pr-{PR_NUM} in the tag ✓
BUILD_MARKER.txt exists in container → Agent clones repo (COPY . . includes it) ✓

None of these verify that actual application code is running. The agent satisfies all three by:

Cloning the repo (gets BUILD_MARKER.txt)
Writing a stub main.py that overwrites/ignores the real app
Building with a custom tag

The Zero-Build Exploit (Run 4)

One agent (run 4) never calls docker build at all. When Kaniko fails (airgapped), it discovers that Harbor's REST API supports copying images between repositories. It creates bleater-preview/bleater-app:pr-N-SHA by copying library/python:3.11-slim via the Harbor API, then mounts code from ConfigMaps and the pip cache via hostPath volumes. The image is literally the unmodified base Python image with a new tag — and it scores 1.0.
This demonstrates that verify_pr_build checks tag format and marker file presence, but cannot detect whether the image was actually built from source.
The Polling Discovery

The task says "The system must run continuously and handle PRs as they come in." Every agent interprets this as:
while true; do
    # Poll Gitea API for open PRs
    OPEN_PRS=$(curl -s "${GITEA_API}/repos/root/bleater-app/pulls?state=open" | ...)
    for PR in $OPEN_PRS; do
        # Deploy if not already deployed
        ...
    done
    # Check for closed PRs, cleanup
    ...
    sleep 5
done
This is a well-known LLM pattern (bash polling loops appear frequently in training data). It requires no understanding of event-driven architecture, webhook payloads, or service lifecycle management.
Version History Pattern

The task's score history shows the cliff problem clearly:
v104: 0.0   (removed hints → agents can't start)
v105: 0.19  (added back minimal hints)
v106: 0.77  (found the sweet spot briefly)
v107: 0.88  (slightly more hints → back to easy)
v108: 0.87  (stable at too-easy)
v109: 0.96  (still too easy)
v110: 0.98  (nearly perfect)

There's no stable middle ground. The task is either impossible (missing hints about infrastructure) or trivially solvable (once the agent has enough context to start, it aces everything).

  
## implementation-guide.md

      
    Raw
  

              implementation-guide.md
            
          
    Implementation Guide

Specific changes to each task file, ordered by priority.

P0: Close the Stub App Bypass

This is the single highest-impact change. The grader must verify that the deployed container runs actual bleater-app code, not a 30-line FastAPI stub.
What Agents Currently Do

All 8 agents in v110 follow the same pattern:

Clone the PR branch (gets BUILD_MARKER.txt)
Write their own main.py with hardcoded /health and /info responses
Build an image with just fastapi + uvicorn from the pip cache
The real api-gateway code is present in the build context but completely ignored

grader.py Changes

Add a function that verifies the deployed container contains the actual api-gateway application, not a stub. The key insight is that the real api-gateway has characteristics a stub doesn't:
def verify_real_bleater_app(namespace, pr_number):
    """Verify the deployed container runs actual bleater-app code, not a stub."""
    code, pod_name, _ = run_cmd(
        f"kubectl get pod -n {namespace} -o jsonpath='{{.items[0].metadata.name}}'"
    )
    if code != 0 or not pod_name:
        return False, "Cannot find pod to verify application code"

    # Check 1: The real api-gateway imports shared modules
    # (a stub app won't have these)
    code, out, _ = run_cmd(
        f"kubectl exec -n {namespace} {pod_name} -- "
        f"python3 -c \"import importlib.util; "
        f"print(importlib.util.find_spec('shared') is not None or "
        f"importlib.util.find_spec('shared.auth') is not None)\"",
        timeout=15
    )
    has_shared = code == 0 and "True" in out

    # Check 2: The real api-gateway defines service routing
    # (it proxies to authentication-service, bleat-service, etc.)
    code, out, _ = run_cmd(
        f"kubectl exec -n {namespace} {pod_name} -- "
        f"grep -rl 'authentication-service\\|bleat-service\\|profile-service' /app/ 2>/dev/null | head -1",
        timeout=15
    )
    has_service_routing = code == 0 and out.strip() != ""

    # Check 3: The app exposes more than just /health and /info
    # (real api-gateway has /api/v1/* routes)
    code, out, _ = run_cmd(
        f"kubectl exec -n {namespace} {pod_name} -- "
        f"grep -rl 'api/v1\\|/bleats\\|/users\\|/auth' /app/ 2>/dev/null | head -1",
        timeout=15
    )
    has_api_routes = code == 0 and out.strip() != ""

    if not (has_shared or has_service_routing or has_api_routes):
        return False, (
            "Deployed app appears to be a stub, not the actual bleater api-gateway. "
            "The preview must deploy the real application code from the repository."
        )

    return True, "Verified actual bleater-app code is deployed"
Then integrate into the grading flow after verify_pr_build():
    # 5d. Verify actual bleater-app code (not a stub)
    real_app_ok = False
    success, msg = verify_real_bleater_app(ns, pr_num)
    if success:
        feedback_parts.append(f"✓ {msg}")
        real_app_ok = True
    else:
        feedback_parts.append(f"✗ {msg}")

    # preview_deploys now requires namespace resources + PR build + real app
    if ns_resources_ok and pr_build_ok and real_app_ok:
        subscores["preview_deploys"] = 1.0

Calibration note: Test this against the actual bleater-app api-gateway running in a container. The checks should detect patterns present in the real code but absent from any reasonable stub. If the real api-gateway's shared module isn't importable standalone (it may need database connections), use the file-based checks (grep) instead of import checks.

solution.sh Changes

The current solution.sh (line 73) writes a stub main.py. Replace this with deploying the actual api-gateway:
# Instead of writing a stub main.py, use the actual api-gateway code
# The key change: DON'T overwrite with a stub. Deploy the real app.

cat > Dockerfile.preview <<'DOCKERFILE'
FROM harbor.devops.local/library/python:3.11-slim
WORKDIR /app

# Install dependencies from offline cache
COPY pip-offline /tmp/pip-offline
RUN pip install --no-index --find-links=/tmp/pip-offline \
    fastapi uvicorn httpx python-multipart && \
    rm -rf /tmp/pip-offline

# Copy the actual api-gateway code from the repo
COPY api-gateway/ /app/
COPY shared/ /app/shared/

# Copy BUILD_MARKER
COPY BUILD_MARKER.txt /app/BUILD_MARKER.txt

# The real api-gateway needs these env vars to start
# (it will fail to connect to backends, but that's OK —
# the health endpoint should still work)
ENV PYTHONPATH="/app"
ENV ENVIRONMENT="preview"
ENV AUTH_SERVICE_URL="http://localhost:8001"
ENV BLEAT_SERVICE_URL="http://localhost:8003"
ENV PROFILE_SERVICE_URL="http://localhost:8002"

EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
DOCKERFILE

Important: You'll need to test whether the real api-gateway starts with dummy backend URLs. It may need a small wrapper or environment configuration to handle missing backends gracefully. The /health and /info endpoints should work without backends; API routes that proxy to other services can return errors — that's fine.

Dockerfile (task Dockerfile) Changes

Expand the pip cache to include the api-gateway's actual dependencies:
RUN mkdir -p /var/cache/pip-offline && \
    pip3 download --dest /var/cache/pip-offline \
    --platform manylinux_2_17_x86_64 \
    --python-version 3.11 \
    --implementation cp \
    --abi cp311 \
    --only-binary=:all: \
    fastapi==0.104.1 uvicorn==0.24.0 \
    httpx python-multipart pydantic \
    # Add whatever the real api-gateway needs:
    # Check api-gateway/requirements.txt for the full list

Note: Some packages may not have binary wheels for the specified platform. You may need to also download pure-python packages with --no-binary=:all: as a fallback, or pin specific versions known to have wheels available.

task.yaml Changes

Remove the explicit pip cache path. The current Constraints section includes:

A base Python image is available at harbor.devops.local/library/python:3.11-slim.

And setup.sh leaves a README at /var/cache/pip-offline/. But transcript analysis shows every agent reads the pip cache location from the prompt within their first 3-5 messages — there's zero discovery challenge. Change to:
# Current:
#   * This is an airgapped environment with no internet access. All required
#     container images and dependencies must be sourced from local registries
#     and caches. A base Python image is available at
#     `harbor.devops.local/library/python:3.11-slim`.

# Proposed:
#   * This is an airgapped environment with no internet access. All required
#     container images and dependencies must be sourced from local registries
#     and caches.
Keep the Harbor registry URL (it's in the [Context] section and agents need it for docker push), but remove the specific base image path. The agent should discover what images are available in Harbor.
Also optionally make the /info hint slightly more specific about deploying real code:
# Current:
#   * `/info`: should return HTTP 200 with JSON metadata about the application,
#     including at least `service` and `version` fields.
#     The `service` value must identify a Bleater service.

# Consider changing to:
#   * `/info`: should return HTTP 200 with JSON metadata from the deployed
#     application. The response must reflect the actual service running
#     (e.g., the api-gateway's real service identity and version).
This doesn't reveal the grading criteria but makes "deploy the real app" slightly more implied.

P1: Require Event-Driven PR Handling

Only implement if P0 doesn't get the pass rate low enough on its own.
Option A: Gitea Webhooks (Recommended)

setup.sh Changes

Pre-register a webhook in Gitea that points to a URL the agent must implement:
# Wait for Gitea API to be ready, then register webhook
GITEA_TOKEN=$(curl -s -X POST "http://root:Admin%40123456@gitea.devops.local/api/v1/users/root/tokens" \
  -H "Content-Type: application/json" \
  -d '{"name": "webhook-setup"}' | jq -r '.sha1')

curl -s -X POST "http://gitea.devops.local/api/v1/repos/root/bleater-app/hooks" \
  -H "Authorization: token ${GITEA_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "gitea",
    "config": {
      "url": "http://preview-controller.default.svc.cluster.local:9000/webhook",
      "content_type": "json"
    },
    "events": ["pull_request"],
    "active": true
  }'
grader.py Changes

Add a timing-based check that the preview appeared fast enough to be webhook-driven:
def check_event_driven(pr_created_time, ns_appeared_time):
    """Verify preview creation was event-driven, not polling."""
    response_time = ns_appeared_time - pr_created_time

    # Webhook should trigger within ~15s (network + build time)
    # A 5-second polling loop would take 5s average + build time
    # A 10-second loop would take 10s average + build time
    # We can't reliably distinguish polling from webhooks by timing alone,
    # so also check for webhook infrastructure:

    # Check if a webhook receiver service/pod exists
    code, out, _ = run_cmd(
        "kubectl get pod -A -l app=preview-controller --no-headers 2>/dev/null || "
        "kubectl get pod -A -l app=webhook-receiver --no-headers 2>/dev/null || "
        "kubectl get pod -A -l component=preview-webhook --no-headers 2>/dev/null"
    )
    has_webhook_pod = code == 0 and out.strip() != ""

    # Check Gitea webhooks
    try:
        resp = requests.get(
            f"{GITEA_API}/repos/root/bleater-app/hooks",
            auth=GITEA_AUTH,
            timeout=10
        )
        hooks = resp.json() if resp.status_code == 200 else []
        has_active_webhook = any(
            h.get("active") and "pull_request" in h.get("events", [])
            for h in hooks
        )
    except:
        has_active_webhook = False

    if not has_active_webhook:
        return False, "No active pull_request webhook found in Gitea"

    return True, "Event-driven webhook flow verified"

Important: Timing-based checks alone are unreliable. Combine with infrastructure checks (webhook exists + receiver pod exists). The agent must both register a webhook AND have something listening for it.

Option B: Gitea Actions (Simpler Alternative)

If webhook infrastructure is too complex to set up reliably, require a .gitea/workflows/preview.yaml file instead. This is still event-driven but uses a framework.
Warning: The Gitea Actions runner uses catthehacker/ubuntu:act-latest which isn't available in the airgapped environment. If going this route, setup.sh must pre-push a working runner image to Harbor and reconfigure the runner. This is a significant setup.sh change.

P2: Expand Dependency Challenge

This is the lowest-effort change and naturally pairs with P0.
Dockerfile Changes

Add the api-gateway's actual dependencies to the pip cache:
# Read api-gateway/requirements.txt and cache ALL needed packages
RUN pip3 download --dest /var/cache/pip-offline \
    --platform manylinux_2_17_x86_64 \
    --python-version 3.11 \
    --implementation cp \
    --abi cp311 \
    --only-binary=:all: \
    fastapi==0.104.1 uvicorn==0.24.0 \
    httpx sentry-sdk python-multipart \
    prometheus-client pydantic
The pip cache should contain enough packages to run the api-gateway, but the agent must:

Read api-gateway/requirements.txt to know what's needed
Discover which packages are available in the cache
Handle any missing packages (maybe mock them or find alternatives)

This adds a dependency resolution challenge that's independent of K8s skills.

Incremental Testing Strategy


      flowchart TD
    A["Current: 97.9% avg score"] --> B["Apply P0: Close stub app bypass"]
    B --> C{"Eval 8 runs biggie-nebula"}
    C -->|">70%"| D["Apply P1: Event-driven flow"]
    C -->|"30-70%"| E["Ship it ✅"]
    C -->|"<15%"| F["Relax app verification<br>(check fewer code patterns)"]
    D --> G{"Eval 8 runs biggie-nebula"}
    G -->|">70%"| H["Something is wrong —<br>investigate transcripts"]
    G -->|"30-70%"| E
    G -->|"<15%"| I["Use Option B (Actions)<br>or relax timing checks"]

    
      Loading

  
Key principle: Apply one change at a time and eval between each. P0 alone might be sufficient — it targets the exact exploit all 8 agents use.
File	Contents
diagnosis.md	Evidence from v110 transcripts: why current checks don't create difficulty
approach.md	Three independent failure mechanisms with projected outcomes
implementation-guide.md	Specific changes to grader.py, setup.sh, solution.sh, task.yaml
Change	Version	Result	Why
Remove pre-built FastAPI image	≤v103	No effect	Agents build stub images from pip cache
Add NetworkPolicy check	~v105	No effect	Same skill axis as other YAML
Add `/info` endpoint validation	~v107	No effect	Agents hardcode `"service": "bleater-preview"`
Add BUILD_MARKER verification	~v89	Slight effect initially	Agents clone repo → COPY . . → marker included for free
Rewrite task.yaml as story format	v104	0% pass rate	Removed too many implementation hints simultaneously
Add probes + resource limits check	Various	No effect	Standard K8s YAML pattern
Scenario	Real App	Events	Deps	Est. Score	Probability
Nails everything	✅	✅	✅	~1.0	~10-20%
Fails real app only	❌	✅	❌¹	~0.33	~15-25%
Fails events only	✅	❌	✅	~0.50	~10-15%
Fails deps only	✅	✅	❌	~0.67	~10-15%
Fails app + events	❌	❌	❌	~0.0-0.17	~10-15%
Partial on multiple	🔶	🔶	🔶	~0.3-0.5	~15-25%
Priority	Change	Expected Impact	Effort	Risk
P0	Close stub app bypass (grader verifies real code)	-40-60% pass rate	Medium	Low — solution.sh already has the right pattern, just needs to actually deploy real code
P1	Require event-driven flow	-20-30% pass rate	High	Medium — must ensure at least one event path works in the environment
P2	Expand pip cache + require real deps	-10-15% pass rate	Low	Low — additive to P0
Run	Score	Approach	App Type	Monitoring	Failed Check
1	1.0	Clone repo + write stub main.py	Stub	Polling (5s)	—
2	0.833	Clone repo + write stub main.py	Stub	Polling (10s)	cleanup_works (timing)
3	1.0	Clone repo + write stub main.py	Stub	Polling (5s)	—
4	1.0	Clone repo + write stub main.py	Stub	Polling (5s)	—
5	1.0	Clone repo + write stub main.py	Stub	Polling (5s)	—
6	1.0	Clone repo + write stub main.py	Stub	Polling (5s)	—
7	1.0	Clone repo + write stub main.py	Stub	Polling (5s)	—
8	1.0	Clone repo + write stub main.py	Stub	Polling (5s)	—