arubis/redis-cluster-task-review.md

## redis-cluster-task-review.md

      
    Raw
  

              redis-cluster-task-review.md
            
          
    Review: Redis Cluster Slot Migration Deadlock (v70)


Task UUID
f925de8b-6df4-4867-b4da-6ff4e1012a8a


Version
v70 (2026-03-11)


Eval model
biggie-nebula, 10 runs


Solution test
PASSES (1.0)


Scores

Mean: 0.54 across 10 runs.
All runs land at exactly 0.4 or 0.6 — a bimodal distribution with no other values:
Run  1: ████████░░ 0.4
Run  2: ████████████░░░ 0.6
Run  3: ████████████░░░ 0.6
Run  4: ████████░░ 0.4
Run  5: ████████████░░░ 0.6
Run  6: ████████████░░░ 0.6
Run  7: ████████░░ 0.4
Run  8: ████████████░░░ 0.6
Run  9: ████████████░░░ 0.6
Run 10: ████████████░░░ 0.6

While 0.54 is below the 0.70 threshold, 0.4 of the total score is unachievable by any agent due to a grader issue. The effective score ceiling is 0.6.

Per-Check Breakdown


Check
Pass Rate
Variance
Verdict


AllSlotsCovered
10/10
None
Always passes — no difficulty signal


GrafanaRedisDashboard
10/10
None
Always passes — no difficulty signal


NoStuckSlots
7/10
Yes
Genuine difficulty (see below)


FanoutNoRedisErrors
0/10
None
Grader bug — undisclosed requirement


FanoutImageIntegrity
0/10
None
Gated on FanoutNoRedisErrors — never evaluated


Only 1 of 5 subscores shows any variance across 10 runs.

  
      ---
config:
  theme: default
---
flowchart LR
    subgraph pass ["✅ Always Pass"]
        A["AllSlotsCovered<br/>10/10"]
        B["GrafanaRedisDashboard<br/>10/10"]
    end
    subgraph vary ["⚡ Genuine Variance"]
        C["NoStuckSlots<br/>7/10"]
    end
    subgraph fail ["❌ Always Fail"]
        D["FanoutNoRedisErrors<br/>0/10"]
        E["FanoutImageIntegrity<br/>0/10"]
    end
    style pass fill:#d4edda,stroke:#28a745
    style vary fill:#fff3cd,stroke:#ffc107
    style fail fill:#f8d7da,stroke:#dc3545

    
      Loading

  
Blocking Issue: [fanout] ADD/DEL Log Patterns

What the grader does

check_fanout_no_redis_errors has two sub-conditions:

No Redis cluster errors in Loki (no MOVED / CLUSTERDOWN / ResponseError) — this is reasonable
Presence of [fanout] ADD or [fanout] DEL in Loki logs — this is the problem

Why it's a problem

Those log patterns exist only in solution.sh's embedded main.py. They are not present in:

The original fanout-service source code (app/fanout-service/app/main.py — uses silent redis_client.lpush() with no logging)
The task.yaml prompt
The Gitea issue created by setup.sh

Across all 10 transcripts, every agent rewrites the fanout code to use RedisCluster and every agent attempts to build and push the image. But zero agents add [fanout] ADD/DEL print statements, because nothing in the environment tells them to.
The cascade effect


      flowchart LR
    A["FanoutNoRedisErrors<br/>(0.2 weight)"] -->|"gates"| B["FanoutImageIntegrity<br/>(0.2 weight)"]
    A -->|"always fails<br/>(undisclosed log pattern)"| X["❌ 0/10"]
    B -->|"auto-skipped"| Y["❌ 0/10"]
    style A fill:#fee,stroke:#c00
    style B fill:#fee,stroke:#c00
    style X fill:#fcc,stroke:#c00
    style Y fill:#fcc,stroke:#c00

    
      Loading

  
Because FanoutImageIntegrity is gated on FanoutNoRedisErrors, agents lose 0.4 points (two checks) for what is effectively one undisclosed requirement. The maximum achievable score for any agent is 0.6.
Recommended fix

Remove the success_query (grader.py lines 204–241). The error-absence check alone already proves the fanout service is operating correctly against the Redis Cluster:
# CURRENT (lines 161–243): two Loki queries
#   1. error_query  — checks for MOVED/CLUSTERDOWN       ✅ keep this
#   2. success_query — checks for [fanout] ADD/DEL        ❌ remove this

# AFTER FIX: only the error-absence check remains
# If no Redis errors are detected in Loki over the polling window,
# the fanout service is working correctly against the cluster.
Alternative: If you want to keep the log-based verification, add the required format to the Gitea issue body in create_gitea_issue():

"The updated fanout service must log processed operations with [fanout] ADD and [fanout] DEL prefixes to stderr for observability."


What's Working Well

NoStuckSlots — genuinely good design

The two-batch chaos injection is the best part of this task:

Batch A (node-0 → node-1): slots 100, 500, 1000 — MIGRATING + DELSLOTS creates coverage gaps
Batch B (node-1 → node-2): slots 6000, 8000 — MIGRATING + IMPORTING markers only

After fixing Batch A, cluster_slots_ok returns 16384 — a misleading "all clear" signal. Three agents (runs 1, 4, 7) trusted this and moved on, never checking node-1 where Batch B's markers live. The seven passing agents queried all three nodes early and caught both batches.
This is a real SRE skill: verifying cluster state comprehensively rather than trusting a single node's perspective. The 70/30 split is a healthy difficulty gradient — keep this as-is.
Grafana panels and slot coverage

Every agent finds the 12 required panel titles from the Gitea issue and correctly patches the configmap. Every agent achieves full slot coverage. These checks validate that the baseline task is clear and achievable, even though they don't contribute to difficulty differentiation.

Smaller Issues

FanoutImageIntegrity gate

Once you fix the log-pattern issue, you'll need to decide what to do with the gate. Be aware that even with the gate removed, roughly half the runs would still fail FanoutImageIntegrity because agents import images to containerd but don't push to Harbor (Docker daemon can't resolve harbor.devops.local — agents must discover ctr images push --plain-http). That's arguably genuine difficulty, but worth verifying after the fix.
inject_stuck_slots hosted validation anomaly

The 10 API transcripts show consistent chaos injection — the setup appears to work. The "passed/failed simultaneously" issue you spent 65+ iterations on may be a race condition specific to hosted validation timing rather than a bug in the function itself. If it persists after the grader fix, consider adding a brief sleep + state verification after the SETSLOT commands to ensure markers have propagated before the script exits.
Subscore rebalancing (optional)

With AllSlotsCovered and GrafanaRedisDashboard both at 100% pass rate, 0.4 of the score is effectively free. After fixing the fanout checks, you may find the mean shifts significantly. Re-evaluate whether the check weights need adjustment once you have fresh eval data.

Action Items


#
Action
Priority


1
Remove success_query from FanoutNoRedisErrors (or document the log format in the Gitea issue)
Blocking


2
Decide whether to keep / remove the FanoutImageIntegrity gate
After fix


3
Re-run evals (8+ biggie-nebula runs) and verify subscore variance
After fix


4
Investigate inject_stuck_slots timing if hosted validation anomaly persists
Low


Bottom Line

The core task is solid. The Redis cluster chaos injection, the multi-node investigation requirement, the image-build-and-push-to-Harbor workflow — these test real SRE and DevOps skills. The grader just needs to be aligned so the fanout check tests functional correctness (no Redis errors) rather than a specific undocumented log format.

Task UUID	`f925de8b-6df4-4867-b4da-6ff4e1012a8a`
Version	v70 (2026-03-11)
Eval model	biggie-nebula, 10 runs
Solution test	PASSES (1.0)
Check	Pass Rate	Variance	Verdict
AllSlotsCovered	10/10	None	Always passes — no difficulty signal
GrafanaRedisDashboard	10/10	None	Always passes — no difficulty signal
NoStuckSlots	7/10	Yes	Genuine difficulty (see below)
FanoutNoRedisErrors	0/10	None	Grader bug — undisclosed requirement
FanoutImageIntegrity	0/10	None	Gated on FanoutNoRedisErrors — never evaluated
#	Action	Priority
1	Remove `success_query` from FanoutNoRedisErrors (or document the log format in the Gitea issue)	Blocking
2	Decide whether to keep / remove the FanoutImageIntegrity gate	After fix
3	Re-run evals (8+ biggie-nebula runs) and verify subscore variance	After fix
4	Investigate inject_stuck_slots timing if hosted validation anomaly persists	Low