Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active March 12, 2026 00:49
Show Gist options
  • Select an option

  • Save arubis/0223a8e487084b3d9759722bf414c54f to your computer and use it in GitHub Desktop.

Select an option

Save arubis/0223a8e487084b3d9759722bf414c54f to your computer and use it in GitHub Desktop.
Addressing "undisclosed spec" issue from nebula-reviewer

Are the "undisclosed spec" findings real? Yes — here's the evidence

Review feedback for Redis Cluster Slot Migration Deadlock (f925de8b, v70). The author asserted that the reviewer bot's findings about undisclosed requirements were false and did not impact the solution. We re-examined the grader, the environment, and all 10 eval transcripts.

A full task review with per-check breakdown and score analysis is also available.


"The bot's recommendations about undisclosed specs are false"

The bot was right. check_fanout_no_redis_errors doesn't just verify the app works with Redis Cluster — it requires specific [fanout] ADD and [fanout] DEL log strings in Loki (grader.py line 204):

success_query = '''count_over_time({app="fanout-service"} |= `[fanout] ADD` or `[fanout] DEL` [5m])'''

These strings don't exist anywhere the agent can find them:

Source Contains [fanout] ADD/DEL?
Original fanout-service code (app/fanout-service/app/main.py) No
task.yaml No
Gitea issue (created by create_gitea_issue() in setup.sh) No
Any wiki, configmap, or annotation in the environment No
solution.sh's embedded main.py Yes — but the agent never sees this

The solution passes because solution.sh writes a main.py that includes those print statements. But an agent approaching this task has no information telling it to add those specific log lines. This isn't a matter of the requirement being "discoverable through expertise" — it's a literal string match against an undocumented format.

Evidence from 10 transcripts: Every agent rewrites the fanout code to use RedisCluster. Every agent patches the configmap to REDIS_USE_CLUSTER=true. 9/10 agents successfully push a new image to Harbor. They're doing exactly what the task asks. But 0/10 include [fanout] ADD/DEL prints, because nothing tells them to.


On the stated purpose of the two checks

The author described the intent as:

  1. Verify that the application is updated to support the new multi-node topology.
  2. Verify that a new image is built and served properly.

Both are valid goals. The problem is that FanoutNoRedisErrors doesn't test goal #1 the way it appears to. It tests two things:

flowchart TD
    CHECK["check_fanout_no_redis_errors()"]
    CHECK --> Q1["1. Error query:<br/>No MOVED / CLUSTERDOWN<br/>in Loki?"]
    CHECK --> Q2["2. Success query:<br/>[fanout] ADD or DEL<br/>in Loki?"]
    Q1 -->|"Tests cluster-mode<br/>correctness"| V1["✅ Validates goal #1"]
    Q2 -->|"Tests a specific<br/>undocumented log format"| V2["❌ Not discoverable"]
    style Q1 fill:#d4edda,stroke:#28a745
    style Q2 fill:#f8d7da,stroke:#dc3545
    style V1 fill:#d4edda,stroke:#28a745
    style V2 fill:#f8d7da,stroke:#dc3545
Loading

An agent that updates the code to RedisCluster, builds a v2 image, pushes to Harbor, deploys it, and has it processing messages cleanly still fails — because its code logs operations differently (or doesn't log them at all, matching the original code's behavior).

The error-absence check (part 1) is sufficient to verify the stated goal. The success-log check (part 2) is the undisclosed part.

For goal #2, FanoutImageIntegrity is gated on FanoutNoRedisErrors, so it's never evaluated against real agent state. If the gate were removed, it would pass in 9/10 runs — providing real signal. As-is, it's dead weight.


On the solution passing proving the task is correct

The solution passes because it was written to match the grader. That's tautological — it doesn't prove the grader's requirements are discoverable.

The question isn't "can code be written that passes?" but "can an agent, given only the information in the environment, determine what the grader expects?"

The agent CAN determine it needs to:

  • Fix Redis slots
  • Update the fanout client to cluster mode
  • Build a new image and push to Harbor
  • Add Grafana panels

All of that is clear from the task context, and all 10 agents attempt all of it.

What the agent CANNOT determine is that the grader also wants [fanout] ADD and [fanout] DEL in the log output. That requirement exists only in the grader and in the solution.


On "application code changes are discouraged"

The reviewer noted: "Generally application code layer changes are discouraged since we are strictly keeping things devops."

This is a separate but related concern. The task requires rewriting Python application code (replacing redis.Redis() with RedisCluster()), building Docker images, and pushing to a registry. Whether that crosses the line from "DevOps" into "application development" is a judgment call — but it's worth noting that the task's difficulty is split between infrastructure work (Redis cluster operations, monitoring) and application work (code rewrite + image build). The infrastructure half works well and has genuine variance. The application half is where the grader issues live.


What needs to change

The author's intent for these checks is sound. The implementation just has a gap between what the checks verify and what they claim to verify.

Change Why
Remove grader.py lines 204–241 (the success_query for [fanout] ADD/DEL) The error-absence check on lines 161–202 already proves the fanout service works with Redis Cluster. The log-format check is undisclosed.
Remove the gate on FanoutImageIntegrity, or test it independently 9/10 agents push to Harbor successfully. This check would provide real signal if it weren't blocked by the log-pattern failure.

After these two changes, the checks would do exactly what was described: verify the app works with the cluster, and verify the image is built and served from Harbor. No new requirements — just aligning the grader with the stated intent.


See also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment