Task: Ephemeral Debug Containers (a9b57469-d16d-4430-9d32-dcb2caea6be4)
Reverting task.yaml to v11 style will help (agreed), but the Gitea issue itself is also a factor. Right now it's a complete specification in a single document -- exact image destination, exact tool list, exact role name, exact ServiceAccount, exact permissions, exact LimitRange values, and exact guidance on how to handle legacy RBAC. Once the agent reads it, the task becomes a checklist with nothing left to discover or infer.
All 8 v13 eval runs follow an identical arc with zero strategic divergence: read task.yaml -> find Gitea issue -> ls /opt/apk-cache -> find Kaniko in Harbor -> build -> RBAC -> done.
Distribute requirements across 2-3 Gitea issues (and optionally a wiki page) that the agent must discover and synthesize. This is more realistic -- in real organizations, requirements come from multiple teams and the engineer has to piece them together.
Voice: Team lead / engineering manager. Problem statement, not a solution spec.
Contains:
- The backstory: dev team can't debug production pods, images are stripped down, current exec workaround is a security problem
- The high-level goal: adopt Kubernetes ephemeral debug containers so devs can troubleshoot without exec access
- A mention that the debug image needs to go into Harbor (and the specific names --
library/debug-toolsfor the image,developer-debuggerfor the role -- since the grader requires these exact names) - A note that this is air-gapped, everything must use what's already in the cluster
- Cross-references: "Security filed their requirements in #2" and "The dev team's tool wishlist is in #3"
Does NOT contain: exact tool list, exact LimitRange values, APK cache location, or step-by-step RBAC instructions.
Voice: Security engineer. Audit finding.
Contains:
- Flags that
developer-testServiceAccount currently has exec access through legacy RBAC - States the security requirements: no exec, no pod deletion, no deployment modifications
- Mentions that existing bindings can't just be deleted because other teams depend on them -- the permissions need to be tightened instead
- May reference issue #1 as the remediation path
Does NOT contain: the role name, exact RBAC manifests, or anything about the debug image.
Voice: Developer. Feature request.
Contains:
- The list of tools: curl, wget, netcat, dig, jq, strace, tcpdump, vim, procps, bash
- Context about why they need these (network debugging, process inspection, etc.)
- A note that they tried installing tools at runtime but it doesn't work because the environment is air-gapped
Does NOT contain: where to put the image, how to build it, RBAC details, or resource limits.
Contains:
- The platform team's standard resource defaults for debug/ephemeral workloads: CPU request 100m / limit 200m, memory request 128Mi / limit 256Mi
- Could live on the Bleater wiki, as a comment on issue #1 from a platform engineer, or as a separate "Platform standards" issue
The idea is that each source gives the agent a piece of the puzzle, but no single source gives the complete checklist. The agent has to find them all and synthesize.
grader.py, solution.sh, and Dockerfile shouldn't need changes. After updating, run 8 biggie-nebula evals to confirm <70%.