You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
v2 note: This revision is based on analysis of the actual v110 task files and full evaluation transcripts. The prior version (v1) was based solely on Discord thread context and contained incorrect assumptions about the pre-built image.
What We Found
We analyzed all 8 transcripts from v110. Every single passing agent follows the same playbook:
Discovers the environment — Gitea creds, Harbor registry, pip cache at /var/cache/pip-offline/
Writes a polling script — while true loop checking Gitea API every 5-15s
Clones the PR branch — Gets BUILD_MARKER.txt into the build context
Writes a ~30-line stub FastAPI app — Hardcoded /health and /info endpoints
Builds a Docker image from stub + pip cache — Uses pip install --no-index --find-links
The single failure (run 2, score 0.833) was a cleanup timing issue — 10s polling interval too slow for the 120s deletion timeout. No agent struggled with the actual task design.
The Core Problem
The task has two primary exploit vectors that make it trivially solvable:
Exploit 1: The Stub App Bypass
The grader checks /health and /info endpoint responses, but never verifies the deployed container runs actual bleater-app code. All 8 agents exploit this identically:
# What agents deploy (a 30-line stub):@app.get("/info")asyncdefinfo():
return {"service": "bleater-preview", "version": "preview"}
The grader's check ("bleater" in service_name.lower()) is trivially satisfied. The agent never needs to understand, build, or deploy the real api-gateway service.
Exploit 2: The Polling Shortcut
The task says "the system must run continuously and handle PRs as they come in" — agents interpret this as a bash while true loop polling Gitea every few seconds. This is:
A pattern LLMs are extremely comfortable with
Much simpler than real event-driven architecture (webhooks, Gitea Actions)
Not how production preview environments work
The Approach: Layered Complexity
Create difficulty across orthogonal skill axes so agents can independently succeed or fail at different challenges. See:
The current task has a single difficulty axis: "can the agent set up K8s YAML + a shell polling script?" We need orthogonal axes — challenges that exercise fundamentally different cognitive skills, so an agent can independently succeed or fail at each one.
The Three Axes
Axis 1: Real Application Deployment (closes the stub bypass)
Current state: Agents deploy a 30-line FastAPI stub. The grader never checks whether actual bleater-app code is running.
Change: The grader verifies the deployed container contains and runs actual bleater-app code — specifically the api-gateway service from the repository.
Skills required:
Navigate and understand the bleater-app repository structure
Identify that api-gateway/ is the service to deploy
Handle the api-gateway's actual dependencies (beyond just fastapi/uvicorn)
Configure enough environment variables for the app to start
Debug application startup failures
Why this is independently hard: This is codebase comprehension — a completely different skill from K8s YAML generation. The agent must read api-gateway/main.py, understand its imports and dependencies, figure out why it fails to start (missing env vars, missing shared modules), and fix the issues. Historical data shows agents frequently struggle with multi-file Python projects that have shared dependencies.
Gated checks: preview_accessible (app must serve real endpoints)
Axis 2: Event-Driven PR Handling (closes the polling shortcut)
Current state: All agents write while true; do curl gitea; sleep 5; done polling loops.
Change: Require webhook-based or Gitea Actions-based PR event handling. The grader either:
Verifies a Gitea webhook exists pointing to an agent-created receiver, OR
Verifies a .gitea/workflows/ file exists and ran, OR
Uses timing-based verification (preview must appear within a tight window that's faster than polling can achieve)
Skills required:
HTTP server design (listen on port, parse webhook JSON payload)
Gitea webhook configuration (API calls to register hooks)
Process lifecycle management (server must stay running, handle concurrent requests)
Why this is independently hard: Event-driven architecture requires fundamentally different thinking than imperative scripting. The agent must design a system that reacts to events rather than polls for changes. This involves understanding webhook delivery, request handling, and service networking — software architecture skills, not infrastructure configuration.
Gated checks: All PR lifecycle checks (entire flow depends on receiving events)
Current state: The pip cache at /var/cache/pip-offline/ has only fastapi and uvicorn — exactly what's needed for a stub app. If we require the real api-gateway, the agent needs more packages.
Change: The real api-gateway has dependencies beyond fastapi/uvicorn (httpx, sentry-sdk, prometheus-client, etc.). The pip cache should contain these packages, but the agent must:
Read the actual requirements.txt
Figure out which packages are available in the cache vs which need alternatives
Possibly modifying the application to work with available packages
Why this is independently hard: This is dependency resolution — a debugging/troubleshooting skill. The agent must iterate through error → fix → retry cycles, which is time-consuming and error-prone for LLMs.
Gated checks: preview_deploys (image must build successfully), preview_accessible (app must start)
Projected pass rate: ~10-20% (down from 97.9%)
Projected mean score: ~0.30-0.45 (down from ~0.98)
This is well below the <70% target, with room to tune up if needed.
Implementation Priority
Priority
Change
Expected Impact
Effort
Risk
P0
Close stub app bypass (grader verifies real code)
-40-60% pass rate
Medium
Low — solution.sh already has the right pattern, just needs to actually deploy real code
P1
Require event-driven flow
-20-30% pass rate
High
Medium — must ensure at least one event path works in the environment
P2
Expand pip cache + require real deps
-10-15% pass rate
Low
Low — additive to P0
Start with P0 alone, eval, and see where the pass rate lands. P0 might be sufficient by itself. P1 is the hardest to implement and highest risk of making the task unsolvable.
Footnotes
If real app deployment fails, some downstream checks also fail (correlated), which is acceptable as long as other axes remain independent. ↩
Once the agent has the polling script + K8s deployment pattern working, all 6 checks pass together. There's no path to scoring 0.3 or 0.5 — it's all-or-nothing (confirmed by the version history: scores cluster around 0.0 or 0.97+).
The Build Verification Bypass
The verify_pr_build() function in the grader checks three things:
Image isn't the raw base python image → Agent names it harbor.devops.local/bleater/preview:pr-N ✓
Image tag contains PR number → Agent includes pr-{PR_NUM} in the tag ✓
BUILD_MARKER.txt exists in container → Agent clones repo (COPY . . includes it) ✓
None of these verify that actual application code is running. The agent satisfies all three by:
Cloning the repo (gets BUILD_MARKER.txt)
Writing a stub main.py that overwrites/ignores the real app
Building with a custom tag
The Zero-Build Exploit (Run 4)
One agent (run 4) never calls docker build at all. When Kaniko fails (airgapped), it discovers that Harbor's REST API supports copying images between repositories. It creates bleater-preview/bleater-app:pr-N-SHA by copying library/python:3.11-slim via the Harbor API, then mounts code from ConfigMaps and the pip cache via hostPath volumes. The image is literally the unmodified base Python image with a new tag — and it scores 1.0.
This demonstrates that verify_pr_build checks tag format and marker file presence, but cannot detect whether the image was actually built from source.
The Polling Discovery
The task says "The system must run continuously and handle PRs as they come in." Every agent interprets this as:
whiletrue;do# Poll Gitea API for open PRs
OPEN_PRS=$(curl -s "${GITEA_API}/repos/root/bleater-app/pulls?state=open"| ...)forPRin$OPEN_PRS;do# Deploy if not already deployed
...
done# Check for closed PRs, cleanup
...
sleep 5
done
This is a well-known LLM pattern (bash polling loops appear frequently in training data). It requires no understanding of event-driven architecture, webhook payloads, or service lifecycle management.
Version History Pattern
The task's score history shows the cliff problem clearly:
v104: 0.0 (removed hints → agents can't start)
v105: 0.19 (added back minimal hints)
v106: 0.77 (found the sweet spot briefly)
v107: 0.88 (slightly more hints → back to easy)
v108: 0.87 (stable at too-easy)
v109: 0.96 (still too easy)
v110: 0.98 (nearly perfect)
There's no stable middle ground. The task is either impossible (missing hints about infrastructure) or trivially solvable (once the agent has enough context to start, it aces everything).
Specific changes to each task file, ordered by priority.
P0: Close the Stub App Bypass
This is the single highest-impact change. The grader must verify that the deployed container runs actual bleater-app code, not a 30-line FastAPI stub.
What Agents Currently Do
All 8 agents in v110 follow the same pattern:
Clone the PR branch (gets BUILD_MARKER.txt)
Write their own main.py with hardcoded /health and /info responses
Build an image with just fastapi + uvicorn from the pip cache
The real api-gateway code is present in the build context but completely ignored
grader.py Changes
Add a function that verifies the deployed container contains the actual api-gateway application, not a stub. The key insight is that the real api-gateway has characteristics a stub doesn't:
defverify_real_bleater_app(namespace, pr_number):
"""Verify the deployed container runs actual bleater-app code, not a stub."""code, pod_name, _=run_cmd(
f"kubectl get pod -n {namespace} -o jsonpath='{{.items[0].metadata.name}}'"
)
ifcode!=0ornotpod_name:
returnFalse, "Cannot find pod to verify application code"# Check 1: The real api-gateway imports shared modules# (a stub app won't have these)code, out, _=run_cmd(
f"kubectl exec -n {namespace}{pod_name} -- "f"python3 -c \"import importlib.util; "f"print(importlib.util.find_spec('shared') is not None or "f"importlib.util.find_spec('shared.auth') is not None)\"",
timeout=15
)
has_shared=code==0and"True"inout# Check 2: The real api-gateway defines service routing# (it proxies to authentication-service, bleat-service, etc.)code, out, _=run_cmd(
f"kubectl exec -n {namespace}{pod_name} -- "f"grep -rl 'authentication-service\\|bleat-service\\|profile-service' /app/ 2>/dev/null | head -1",
timeout=15
)
has_service_routing=code==0andout.strip() !=""# Check 3: The app exposes more than just /health and /info# (real api-gateway has /api/v1/* routes)code, out, _=run_cmd(
f"kubectl exec -n {namespace}{pod_name} -- "f"grep -rl 'api/v1\\|/bleats\\|/users\\|/auth' /app/ 2>/dev/null | head -1",
timeout=15
)
has_api_routes=code==0andout.strip() !=""ifnot (has_sharedorhas_service_routingorhas_api_routes):
returnFalse, (
"Deployed app appears to be a stub, not the actual bleater api-gateway. ""The preview must deploy the real application code from the repository."
)
returnTrue, "Verified actual bleater-app code is deployed"
Then integrate into the grading flow after verify_pr_build():
# 5d. Verify actual bleater-app code (not a stub)real_app_ok=Falsesuccess, msg=verify_real_bleater_app(ns, pr_num)
ifsuccess:
feedback_parts.append(f"✓ {msg}")
real_app_ok=Trueelse:
feedback_parts.append(f"✗ {msg}")
# preview_deploys now requires namespace resources + PR build + real appifns_resources_okandpr_build_okandreal_app_ok:
subscores["preview_deploys"] =1.0
Calibration note: Test this against the actual bleater-app api-gateway running in a container. The checks should detect patterns present in the real code but absent from any reasonable stub. If the real api-gateway's shared module isn't importable standalone (it may need database connections), use the file-based checks (grep) instead of import checks.
solution.sh Changes
The current solution.sh (line 73) writes a stub main.py. Replace this with deploying the actual api-gateway:
# Instead of writing a stub main.py, use the actual api-gateway code# The key change: DON'T overwrite with a stub. Deploy the real app.
cat > Dockerfile.preview <<'DOCKERFILE'FROM harbor.devops.local/library/python:3.11-slimWORKDIR /app# Install dependencies from offline cacheCOPY pip-offline /tmp/pip-offlineRUN pip install --no-index --find-links=/tmp/pip-offline \ fastapi uvicorn httpx python-multipart && \ rm -rf /tmp/pip-offline# Copy the actual api-gateway code from the repoCOPY api-gateway/ /app/COPY shared/ /app/shared/# Copy BUILD_MARKERCOPY BUILD_MARKER.txt /app/BUILD_MARKER.txt# The real api-gateway needs these env vars to start# (it will fail to connect to backends, but that's OK —# the health endpoint should still work)ENV PYTHONPATH="/app"ENV ENVIRONMENT="preview"ENV AUTH_SERVICE_URL="http://localhost:8001"ENV BLEAT_SERVICE_URL="http://localhost:8003"ENV PROFILE_SERVICE_URL="http://localhost:8002"EXPOSE 8080CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]DOCKERFILE
Important: You'll need to test whether the real api-gateway starts with dummy backend URLs. It may need a small wrapper or environment configuration to handle missing backends gracefully. The /health and /info endpoints should work without backends; API routes that proxy to other services can return errors — that's fine.
Dockerfile (task Dockerfile) Changes
Expand the pip cache to include the api-gateway's actual dependencies:
RUN mkdir -p /var/cache/pip-offline && \
pip3 download --dest /var/cache/pip-offline \
--platform manylinux_2_17_x86_64 \
--python-version 3.11 \
--implementation cp \
--abi cp311 \
--only-binary=:all: \
fastapi==0.104.1 uvicorn==0.24.0 \
httpx python-multipart pydantic \
# Add whatever the real api-gateway needs:# Check api-gateway/requirements.txt for the full list
Note: Some packages may not have binary wheels for the specified platform. You may need to also download pure-python packages with --no-binary=:all: as a fallback, or pin specific versions known to have wheels available.
task.yaml Changes
Remove the explicit pip cache path. The current Constraints section includes:
A base Python image is available at harbor.devops.local/library/python:3.11-slim.
And setup.sh leaves a README at /var/cache/pip-offline/. But transcript analysis shows every agent reads the pip cache location from the prompt within their first 3-5 messages — there's zero discovery challenge. Change to:
# Current:# * This is an airgapped environment with no internet access. All required# container images and dependencies must be sourced from local registries# and caches. A base Python image is available at# `harbor.devops.local/library/python:3.11-slim`.# Proposed:# * This is an airgapped environment with no internet access. All required# container images and dependencies must be sourced from local registries# and caches.
Keep the Harbor registry URL (it's in the [Context] section and agents need it for docker push), but remove the specific base image path. The agent should discover what images are available in Harbor.
Also optionally make the /info hint slightly more specific about deploying real code:
# Current:# * `/info`: should return HTTP 200 with JSON metadata about the application,# including at least `service` and `version` fields.# The `service` value must identify a Bleater service.# Consider changing to:# * `/info`: should return HTTP 200 with JSON metadata from the deployed# application. The response must reflect the actual service running# (e.g., the api-gateway's real service identity and version).
This doesn't reveal the grading criteria but makes "deploy the real app" slightly more implied.
P1: Require Event-Driven PR Handling
Only implement if P0 doesn't get the pass rate low enough on its own.
Option A: Gitea Webhooks (Recommended)
setup.sh Changes
Pre-register a webhook in Gitea that points to a URL the agent must implement:
# Wait for Gitea API to be ready, then register webhook
GITEA_TOKEN=$(curl -s -X POST "http://root:Admin%40123456@gitea.devops.local/api/v1/users/root/tokens" \ -H "Content-Type: application/json" \ -d '{"name": "webhook-setup"}'| jq -r '.sha1')
curl -s -X POST "http://gitea.devops.local/api/v1/repos/root/bleater-app/hooks" \
-H "Authorization: token ${GITEA_TOKEN}" \
-H "Content-Type: application/json" \
-d '{ "type": "gitea", "config": { "url": "http://preview-controller.default.svc.cluster.local:9000/webhook", "content_type": "json" }, "events": ["pull_request"], "active": true }'
grader.py Changes
Add a timing-based check that the preview appeared fast enough to be webhook-driven:
defcheck_event_driven(pr_created_time, ns_appeared_time):
"""Verify preview creation was event-driven, not polling."""response_time=ns_appeared_time-pr_created_time# Webhook should trigger within ~15s (network + build time)# A 5-second polling loop would take 5s average + build time# A 10-second loop would take 10s average + build time# We can't reliably distinguish polling from webhooks by timing alone,# so also check for webhook infrastructure:# Check if a webhook receiver service/pod existscode, out, _=run_cmd(
"kubectl get pod -A -l app=preview-controller --no-headers 2>/dev/null || ""kubectl get pod -A -l app=webhook-receiver --no-headers 2>/dev/null || ""kubectl get pod -A -l component=preview-webhook --no-headers 2>/dev/null"
)
has_webhook_pod=code==0andout.strip() !=""# Check Gitea webhookstry:
resp=requests.get(
f"{GITEA_API}/repos/root/bleater-app/hooks",
auth=GITEA_AUTH,
timeout=10
)
hooks=resp.json() ifresp.status_code==200else []
has_active_webhook=any(
h.get("active") and"pull_request"inh.get("events", [])
forhinhooks
)
except:
has_active_webhook=Falseifnothas_active_webhook:
returnFalse, "No active pull_request webhook found in Gitea"returnTrue, "Event-driven webhook flow verified"
Important: Timing-based checks alone are unreliable. Combine with infrastructure checks (webhook exists + receiver pod exists). The agent must both register a webhook AND have something listening for it.
Option B: Gitea Actions (Simpler Alternative)
If webhook infrastructure is too complex to set up reliably, require a .gitea/workflows/preview.yaml file instead. This is still event-driven but uses a framework.
Warning: The Gitea Actions runner uses catthehacker/ubuntu:act-latest which isn't available in the airgapped environment. If going this route, setup.sh must pre-push a working runner image to Harbor and reconfigure the runner. This is a significant setup.sh change.
P2: Expand Dependency Challenge
This is the lowest-effort change and naturally pairs with P0.
Dockerfile Changes
Add the api-gateway's actual dependencies to the pip cache:
The pip cache should contain enough packages to run the api-gateway, but the agent must:
Read api-gateway/requirements.txt to know what's needed
Discover which packages are available in the cache
Handle any missing packages (maybe mock them or find alternatives)
This adds a dependency resolution challenge that's independent of K8s skills.
Incremental Testing Strategy
flowchart TD
A["Current: 97.9% avg score"] --> B["Apply P0: Close stub app bypass"]
B --> C{"Eval 8 runs biggie-nebula"}
C -->|">70%"| D["Apply P1: Event-driven flow"]
C -->|"30-70%"| E["Ship it ✅"]
C -->|"<15%"| F["Relax app verification<br>(check fewer code patterns)"]
D --> G{"Eval 8 runs biggie-nebula"}
G -->|">70%"| H["Something is wrong —<br>investigate transcripts"]
G -->|"30-70%"| E
G -->|"<15%"| I["Use Option B (Actions)<br>or relax timing checks"]
Loading
Key principle: Apply one change at a time and eval between each. P0 alone might be sufficient — it targets the exact exploit all 8 agents use.