Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active March 12, 2026 00:28
Show Gist options
  • Select an option

  • Save arubis/79e3cb28fa531987b1b123063dc190f8 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/79e3cb28fa531987b1b123063dc190f8 to your computer and use it in GitHub Desktop.
Review: Redis Cluster Slot Migration Deadlock (v70) — f925de8b

Review: Redis Cluster Slot Migration Deadlock (v70)

Task UUID f925de8b-6df4-4867-b4da-6ff4e1012a8a
Version v70 (2026-03-11)
Eval model biggie-nebula, 10 runs
Solution test PASSES (1.0)

Scores

Mean: 0.54 across 10 runs.

All runs land at exactly 0.4 or 0.6 — a bimodal distribution with no other values:

Run  1: ████████░░ 0.4
Run  2: ████████████░░░ 0.6
Run  3: ████████████░░░ 0.6
Run  4: ████████░░ 0.4
Run  5: ████████████░░░ 0.6
Run  6: ████████████░░░ 0.6
Run  7: ████████░░ 0.4
Run  8: ████████████░░░ 0.6
Run  9: ████████████░░░ 0.6
Run 10: ████████████░░░ 0.6

While 0.54 is below the 0.70 threshold, 0.4 of the total score is unachievable by any agent due to a grader issue. The effective score ceiling is 0.6.


Per-Check Breakdown

Check Pass Rate Variance Verdict
AllSlotsCovered 10/10 None Always passes — no difficulty signal
GrafanaRedisDashboard 10/10 None Always passes — no difficulty signal
NoStuckSlots 7/10 Yes Genuine difficulty (see below)
FanoutNoRedisErrors 0/10 None Grader bug — undisclosed requirement
FanoutImageIntegrity 0/10 None Gated on FanoutNoRedisErrors — never evaluated

Only 1 of 5 subscores shows any variance across 10 runs.

---
config:
  theme: default
---
flowchart LR
    subgraph pass ["✅ Always Pass"]
        A["AllSlotsCovered<br/>10/10"]
        B["GrafanaRedisDashboard<br/>10/10"]
    end
    subgraph vary ["⚡ Genuine Variance"]
        C["NoStuckSlots<br/>7/10"]
    end
    subgraph fail ["❌ Always Fail"]
        D["FanoutNoRedisErrors<br/>0/10"]
        E["FanoutImageIntegrity<br/>0/10"]
    end
    style pass fill:#d4edda,stroke:#28a745
    style vary fill:#fff3cd,stroke:#ffc107
    style fail fill:#f8d7da,stroke:#dc3545
Loading

Blocking Issue: [fanout] ADD/DEL Log Patterns

What the grader does

check_fanout_no_redis_errors has two sub-conditions:

  1. No Redis cluster errors in Loki (no MOVED / CLUSTERDOWN / ResponseError) — this is reasonable
  2. Presence of [fanout] ADD or [fanout] DEL in Loki logs — this is the problem

Why it's a problem

Those log patterns exist only in solution.sh's embedded main.py. They are not present in:

  • The original fanout-service source code (app/fanout-service/app/main.py — uses silent redis_client.lpush() with no logging)
  • The task.yaml prompt
  • The Gitea issue created by setup.sh

Across all 10 transcripts, every agent rewrites the fanout code to use RedisCluster and every agent attempts to build and push the image. But zero agents add [fanout] ADD/DEL print statements, because nothing in the environment tells them to.

The cascade effect

flowchart LR
    A["FanoutNoRedisErrors<br/>(0.2 weight)"] -->|"gates"| B["FanoutImageIntegrity<br/>(0.2 weight)"]
    A -->|"always fails<br/>(undisclosed log pattern)"| X["❌ 0/10"]
    B -->|"auto-skipped"| Y["❌ 0/10"]
    style A fill:#fee,stroke:#c00
    style B fill:#fee,stroke:#c00
    style X fill:#fcc,stroke:#c00
    style Y fill:#fcc,stroke:#c00
Loading

Because FanoutImageIntegrity is gated on FanoutNoRedisErrors, agents lose 0.4 points (two checks) for what is effectively one undisclosed requirement. The maximum achievable score for any agent is 0.6.

Recommended fix

Remove the success_query (grader.py lines 204–241). The error-absence check alone already proves the fanout service is operating correctly against the Redis Cluster:

# CURRENT (lines 161–243): two Loki queries
#   1. error_query  — checks for MOVED/CLUSTERDOWN       ✅ keep this
#   2. success_query — checks for [fanout] ADD/DEL        ❌ remove this

# AFTER FIX: only the error-absence check remains
# If no Redis errors are detected in Loki over the polling window,
# the fanout service is working correctly against the cluster.

Alternative: If you want to keep the log-based verification, add the required format to the Gitea issue body in create_gitea_issue():

"The updated fanout service must log processed operations with [fanout] ADD and [fanout] DEL prefixes to stderr for observability."


What's Working Well

NoStuckSlots — genuinely good design

The two-batch chaos injection is the best part of this task:

  • Batch A (node-0 → node-1): slots 100, 500, 1000 — MIGRATING + DELSLOTS creates coverage gaps
  • Batch B (node-1 → node-2): slots 6000, 8000 — MIGRATING + IMPORTING markers only

After fixing Batch A, cluster_slots_ok returns 16384 — a misleading "all clear" signal. Three agents (runs 1, 4, 7) trusted this and moved on, never checking node-1 where Batch B's markers live. The seven passing agents queried all three nodes early and caught both batches.

This is a real SRE skill: verifying cluster state comprehensively rather than trusting a single node's perspective. The 70/30 split is a healthy difficulty gradient — keep this as-is.

Grafana panels and slot coverage

Every agent finds the 12 required panel titles from the Gitea issue and correctly patches the configmap. Every agent achieves full slot coverage. These checks validate that the baseline task is clear and achievable, even though they don't contribute to difficulty differentiation.


Smaller Issues

FanoutImageIntegrity gate

Once you fix the log-pattern issue, you'll need to decide what to do with the gate. Be aware that even with the gate removed, roughly half the runs would still fail FanoutImageIntegrity because agents import images to containerd but don't push to Harbor (Docker daemon can't resolve harbor.devops.local — agents must discover ctr images push --plain-http). That's arguably genuine difficulty, but worth verifying after the fix.

inject_stuck_slots hosted validation anomaly

The 10 API transcripts show consistent chaos injection — the setup appears to work. The "passed/failed simultaneously" issue you spent 65+ iterations on may be a race condition specific to hosted validation timing rather than a bug in the function itself. If it persists after the grader fix, consider adding a brief sleep + state verification after the SETSLOT commands to ensure markers have propagated before the script exits.

Subscore rebalancing (optional)

With AllSlotsCovered and GrafanaRedisDashboard both at 100% pass rate, 0.4 of the score is effectively free. After fixing the fanout checks, you may find the mean shifts significantly. Re-evaluate whether the check weights need adjustment once you have fresh eval data.


Action Items

# Action Priority
1 Remove success_query from FanoutNoRedisErrors (or document the log format in the Gitea issue) Blocking
2 Decide whether to keep / remove the FanoutImageIntegrity gate After fix
3 Re-run evals (8+ biggie-nebula runs) and verify subscore variance After fix
4 Investigate inject_stuck_slots timing if hosted validation anomaly persists Low

Bottom Line

The core task is solid. The Redis cluster chaos injection, the multi-node investigation requirement, the image-build-and-push-to-Harbor workflow — these test real SRE and DevOps skills. The grader just needs to be aligned so the fanout check tests functional correctness (no Redis errors) rather than a specific undocumented log format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment