Advisory for task authors. Helps you find ideas that will pass overlap review on the first try.
| Closed | Mostly Saturated | Partially Explored | Wide Open |
|---|---|---|---|
| CI/CD Pipeline Flow | Prometheus + Grafana | KEDA Autoscaling | GlitchTip |
| ArgoCD Sync + Drift | Loki + Fluent Bit | Istio Service Mesh | Maddy Mail Server |
| PostgreSQL | Gitea + Actions | MongoDB | Statping-ng |
| Harbor Registry | Jaeger Tracing | CronJobs / DaemonSets / StatefulSets | |
| Keycloak / Auth Services | ConfigMap + Secret Propagation | ||
| ResourceQuotas + LimitRanges | MinIO Lifecycle | ||
| Node Pressure + Eviction | Grafana OnCall |
Closed = your idea will almost certainly fail overlap review, regardless of how you frame it. 10+ existing tasks cover every major angle. Don't submit new tasks here.
Mostly saturated = heavy existing coverage (5-14 tasks), but narrow openings remain. You'll need a clearly distinct angle — see Remaining Angles for what's left, and the Framing Guide for strategies to survive overlap review in these areas.
Partially explored = room exists, but check existing ideas first to make sure your specific angle is distinct.
Wide open = strong opportunities with zero or near-zero existing coverage. Start here.
Before proposing an idea, run through these questions:
-
Is it a PostgreSQL task?
- Yes → Will not pass. 16-20 existing tasks cover every angle (WAL, HA, split-brain, DR, migrations, pooling, credential rotation, operator management). See Appendix A.
- No → Continue.
-
Does the primary challenge involve CI/CD pipeline flow? (Gitea Actions → Harbor push/pull → ArgoCD sync → image promotion)
- Yes → Will not pass. 13 tasks cover every pipeline stage. See Appendix A.
- No → Continue.
-
Does it involve ArgoCD sync, drift, or reconciliation?
- Yes → Will not pass. 12 tasks cover sync loops, persistent drift, wave deadlocks, AppProject RBAC, and image updater.
- No → Continue.
-
Does it involve resource quotas, limits, or eviction? (ResourceQuota, LimitRange, node pressure, PriorityClasses)
- Yes → Mostly saturated. See Appendix B and check Remaining Angles.
- No → Continue.
-
Does it involve Keycloak, SSO, OIDC, or authentication services?
- Yes → Mostly saturated. 6+ tasks cover IAM deployment, SSO integration across dev tools, auth chain drift, key rotation, and API gateway auth. Check Remaining Angles for what's left.
- No → Continue.
-
Does it target Prometheus/Grafana, Loki/Fluent Bit, or another "Mostly Saturated" component?
- Yes → Narrow openings exist but you need a clearly distinct angle. Check Remaining Angles before proposing.
- No → Continue.
-
Does it target a "Wide Open" component from the table above?
- Yes → Strong opportunity. Propose it.
- No → Check the forum for existing ideas in that area before proposing.
Tip
If you landed on "will not pass" but still want to use those components, read The Escape Pattern — it's possible to write viable tasks that touch closed components as long as the primary challenge operates at a different layer.
These components have heavy existing coverage but specific narrow openings remain. If you want to write a task here, it must target one of these gaps — generic ideas in these areas will fail overlap review.
| Component | Tasks | What's Left |
|---|---|---|
| Prometheus + Grafana | ~14 | SLO/SLI burn-rate methodology, Alertmanager routing trees + inhibition rules, remote write / federation, Grafana-as-Code provisioning |
| Loki + Fluent Bit | ~8 | LogQL-based alerting rules (Loki ruler), FluentBit parser/filter chain debugging (not throughput/backpressure) |
| Gitea + Actions | ~10 | Repository governance (branch protection, merge policies), workflow YAML authoring/debugging, runner resource management. Not the pipeline flow. |
| Harbor Registry | ~8 | Robot account management, per-project storage quotas, retention policies, replication configuration. Not push/pull/GC/auth in a pipeline context. |
| ResourceQuotas + LimitRanges | ~5 | LimitRange default injection failures (implicit limits causing non-obvious OOMKills). Very narrow. |
| Node Pressure + Eviction | ~6 | Disk pressure eviction specifically (ephemeral storage, imagefs vs nodefs). Memory/CPU/PID paths are covered. |
| Keycloak / Auth Services | ~6 | OIDC federation failures across multiple realms, Keycloak upgrade/migration scenarios, auth audit/compliance reporting. Core SSO integration (single-realm OIDC clients, role mapping, key rotation, token validation) is thoroughly covered. |
Warning
Even for these remaining angles, check the forum first. Ideas here are harder to get right, and reviewers will scrutinize overlap carefully. See the Framing Guide for how to frame ideas that survive review in saturated areas.
Ranked by how much clean surface area exists. Each includes concrete task concepts ready to propose.
1. KEDA (Event-Driven Autoscaling)
KEDA is deployed in the cluster but has limited task coverage targeting it directly. The debugging surface is distinct from HPA: misconfigured triggers that silently don't fire, TriggerAuthentication failures against event sources (RabbitMQ, Prometheus), conflicts when both HPA and KEDA target the same deployment, and scaling-to-zero edge cases.
Best fit: Cloud Ops or Platform Engineering
Starter concepts:
- KEDA Trigger Authentication Failure Blocks Event-Driven Autoscaling
- HPA/KEDA Scaling Conflict Causes Pod Count Oscillation
2. GlitchTip (Error Tracking)
GlitchTip is Nebula's Sentry-compatible error tracking platform. It's a genuinely distinct observability dimension from metrics (Prometheus), logs (Loki), and traces (Jaeger). Tasks could target DSN misconfiguration causing silent event loss, ingestion pipeline failures, or alert routing that masks critical exceptions. Some adjacent coverage exists, so frame ideas around GlitchTip-specific failure modes rather than generic observability.
Best fit: SRE or Cloud Ops
Starter concepts:
- GlitchTip Error Ingestion Pipeline Failure — Services Silently Dropping Exceptions
- GlitchTip Alert Routing Misconfiguration Masks Production Errors
3. Statping-ng (Status Page)
Zero approved ideas. Statping-ng is a standalone status page with its own health checks and user-facing availability dashboard. Distinct from Blackbox Exporter synthetic monitoring (which feeds into Prometheus/Grafana).
Best fit: SRE
Starter concepts:
- Status Page Reports All-Green While Services Are Down
- Statping-ng Flapping Monitors Flood Notification Channels
4. Maddy (Mail Server)
Zero approved ideas. Maddy handles SMTP relay for the platform (Grafana notifications, OnCall alerts, etc.). Runs as a StatefulSet with three mailboxes (devops@, operator@, opsmanager@nebula.local). SMTP misconfiguration, TLS negotiation failures, and the downstream impact of alert emails never arriving are all clean territory.
Best fit: SRE or DevOps
Starter concepts:
- Maddy SMTP Relay Failure Silently Drops Alert Notification Emails
Zero approved ideas targeting these workload types specifically. CronJob failure chains (missed schedules, concurrency policy deadlocks), DaemonSet rolling updates creating gaps, and StatefulSet ordered scaling with PVC lifecycle issues are all untouched.
Best fit: Cloud Ops
Starter concepts:
- CronJob Concurrency Policy Deadlock Causes Backup Job Backlog
- StatefulSet Scale-Down Orphans Persistent Volumes
- DaemonSet Rolling Update Creates Logging Gap
No approved ideas target the K8s propagation problem. Distinct from credential rotation tasks (which are about the values) — this is about the delivery mechanism: ConfigMap updated but pods serve stale config, immutable ConfigMap blocks emergency fixes, Secret rotation leaves pods split across old/new values.
Best fit: Cloud Ops or Platform Engineering
Starter concepts:
- ConfigMap Update Propagation Failure — Pods Serve Stale Configuration
- Immutable ConfigMap Trap Blocks Emergency Configuration Fix
7. Grafana OnCall (Incident Response)
One vague approved idea exists (just a title, no description). Room for clearly scoped tasks around escalation chain failures, schedule rotation bugs, or integration breakdowns between OnCall and Mattermost/Maddy.
Best fit: SRE
Starter concepts:
- Grafana OnCall Escalation Chain Broken — Incidents Route to Nobody
- On-Call Schedule Rotation Failure During Handoff Window
Some coverage exists, but tasks focused on traffic management (as opposed to resource pressure from sidecars) have room. mTLS policy failures, VirtualService routing misconfigurations, and sidecar injection issues in specific namespaces are potential angles.
Best fit: Platform Engineering or SRE
Distinct from Harbor registry operations. Lifecycle policies, bucket versioning, cross-service storage access patterns.
Best fit: Cloud Ops
The platform also includes RabbitMQ, Redis, CoreDNS, Mattermost, and Chaos Mesh — among others. If your idea targets a component not in any column, check the forum for existing coverage. Components absent from this table simply haven't been categorized yet, not necessarily that they're open or closed.
If your idea touches saturated components but the primary challenge operates at a different Kubernetes layer, it can still work.
The stack has distinct operational layers. Existing tasks saturate the middle layers; the edges are less covered:
┌─────────────────────────────────────────────────────┐
│ API Admission (webhooks, CRD validation) │ ← less covered
├─────────────────────────────────────────────────────┤
│ Scheduling + Resources (quotas, eviction, priority)│ ← SATURATED
├─────────────────────────────────────────────────────┤
│ Workload Orchestration (GitOps, deploys, rollouts) │ ← SATURATED
├─────────────────────────────────────────────────────┤
│ Application Runtime (pods, services, networking)│ ← partially covered
├─────────────────────────────────────────────────────┤
│ Data + Storage (databases, queues, object) │ ← less covered
└─────────────────────────────────────────────────────┘
Examples of the escape pattern working:
-
Admission Webhook Cascade Failure — touches KEDA, Istio, and ArgoCD (all "saturated" components) but the actual challenge is API admission control, CRD versioning, and webhook lifecycle. Same components, different layer. Approved.
-
The Operator Takeover — originally framed as "deploy CloudNativePG" (overlaps with PostgreSQL HA). Reframed to "live-migrate production databases under traffic" — a distinct operation category (migration execution vs. greenfield build). Approved after reframing.
Note
The key question: Is the primary challenge about the same operation as an existing task (build, troubleshoot, configure), or about a fundamentally different operation (migrate, audit, enforce, orchestrate) that happens to involve the same components?
For more reframing strategies with real before/after examples, see the Framing Guide.
| Category | Spec ID |
|---|---|
| DevOps | b407a435-9dc1-4cc3-950c-3194a8f08fde |
| SRE | 46394e31-2a74-47c1-8359-51e1b678146d |
| Platform Engineering | 9e4d158e-96ff-4435-ab39-4d1e389f4b47 |
| Cloud Ops | 450f2e9c-ba04-429c-bf80-e22be0065313 |
Everything above is actionable guidance. Everything below is the proof.
13 approved tasks and ideas cover the full pipeline from code push to deployment:
Gitea Actions ──→ Docker Build ──→ Harbor Push ──→ ArgoCD Sync ──→ K8s Deploy
│ │ │ │
▼ ▼ ▼ ▼
Cascading CI/CD Harbor GC Deadlock Sync Wave Deployment
Breaking CI/CD GitOps Image Update Deadlock Rollout
Webhook Amplif. Broken Promotion GitOps Drift Failures
The Broken Delivery Sync Loop
Canary Rollouts
Every stage of the pipeline has at least two tasks covering its failure modes. The full inventory:
| # | Task/Idea | Component Focus | Status |
|---|---|---|---|
| 1 | Bleater GitOps Pipeline Repair | Gitea Actions, ArgoCD Image Updater, Harbor | Implemented |
| 2 | Harbor Registry GC Deadlock | Harbor storage, GC jobs | Implemented |
| 3 | ArgoCD Sync Wave Deadlock | ArgoCD sync waves, PreSync hooks | Implemented |
| 4 | Cascading CI/CD Pipeline Failures | Gitea Runner, Harbor creds, ArgoCD, disk space | Implemented |
| 5 | Deployment Rollout Failures | Deployments, security contexts, quotas | Implemented |
| 6 | Breaking CI/CD Pipeline | Gitea Actions tagging, Harbor permissions, ArgoCD updater | Approved |
| 7 | GitOps Image Update + Harbor Auth | Image Updater, Harbor tokens, Helm values | Approved |
| 8 | Broken GitOps Image Promotion | Harbor webhooks, Image Updater auth | Yellow |
| 9 | ArgoCD GitOps Sync Loop | Mutating webhook, KEDA conflict, Helm values | Approved |
| 10 | GitOps Drift That Survives Every Sync | Admission controllers, image automation | Approved |
| 11 | Gitea Webhook Amplification | Gitea webhooks, ArgoCD, Harbor jobservice | Yellow |
| 12 | The Broken Delivery | regcred secret, Helm registry override, CI error masking | Pending |
| 13 | GitOps Canary Rollouts Migration | ArgoCD ApplicationSets, Argo Rollouts, Istio | Pending |
8 approved tasks and ideas cover Kubernetes resource management:
| # | Task/Idea | Component Focus | Status |
|---|---|---|---|
| 1 | Single-Node Chaos Hardening | Node memory pressure, eviction, scheduling | Implemented |
| 2 | Chaos Engineering Resilience | Chaos Mesh, pod-kill, network latency, CPU stress | Implemented |
| 3 | Resource Quota Deadlocks | ResourceQuotas, LimitRanges, PVC quotas | Approved |
| 4 | Deployment Rollout Failures | Resource quotas, security contexts | Implemented |
| 5 | Zombie Process PID Exhaustion | PID limits, init system, process reaping | Yellow |
| 6 | Node Operations — Eviction Mirage | Node drain, PDB, readiness timing | Yellow/rejected |
| 7 | Admission Webhook Cascade | Webhooks, CRD versioning, KEDA finalizers | Approved |
| 8 | Autoscaler Quota Spiral | KEDA, HPA, ResourceQuotas | Implemented |
Warning
A coherent causal chain connecting CI/CD to resource exhaustion (e.g., "CI storm causes node pressure which evicts critical services") still fails overlap review because each link in the chain is individually claimed by an existing task. Reviewers evaluate overlap at the component × failure-mode level, not at the narrative level.
Six specific constructions were tested:
| # | Proposed Chain | Why It Fails |
|---|---|---|
| 1 | CI storm → node pressure → critical service eviction | CI storm = Cascading CI/CD (#4). Node pressure + eviction = Single-Node Chaos (#1). |
| 2 | Harbor GC → storage exhaustion → CI blockage | Harbor GC = Harbor GC Deadlock (#2). CI blockage from registry = GitOps Pipeline Repair (#1). |
| 3 | Webhook amplification → ArgoCD CPU spike → reconciliation failure | Webhooks = Gitea Webhook Amplification (#11). ArgoCD failure = ArgoCD Sync Loop (#9). |
| 4 | ResourceQuota too tight → deploys fail → CI hangs | Quotas = Resource Quota Deadlocks (#3 resource) + Deployment Rollout Failures (#5 CI/CD). |
| 5 | KEDA autoscaling → quota ceiling → cascade | KEDA + quota = Autoscaler Quota Spiral (#8 resource). |
| 6 | Image pull failures → pod churn → memory pressure → eviction | Image pulls = GitOps Pipeline Repair (#1 CI/CD). Eviction = Single-Node Chaos (#1 resource). |
Components that appear unclaimed (Docker daemon, containerd, etcd) require root access that agents don't have. Components that are unclaimed but narrow (Trivy scanning, Harbor replication, inode exhaustion) can't sustain a 4-hour horizon.
These illustrate the overlap problem in practice:
-
"The Repository Knot" — A well-constructed four-layer Gitea failure scenario (default branch switch, connection pool exhaustion, mirror sync overwrites, webhook deadlock). The nebula-reviewer bot flagged 86-88% overlap across three different framings. Rejected.
-
"Harbor CI/CD Pipeline Resource Cascade Failure" — A seven-issue cascade across CI/CD and resource exhaustion. Despite multiple attempts to narrow scope and create a "coherent causal chain," every construction overlapped with 2-5 existing tasks. The author was redirected to explore alternative domains.
-
"The Operator Takeover" — Originally "deploy CloudNativePG + PgBouncer + WAL archiving." Overlapped with PostgreSQL HA + PgBouncer (Patroni). Successfully reframed to focus on live migration execution — a distinct operation category. Approved after reframing.
This advisory is based on a comprehensive analysis of all approved tasks, implemented tasks, and pending ideas across both #task-idea-feedback and #task-feedback channels, cross-referenced against the full Nebula infrastructure inventory.
Detailed supporting analysis:
- Closed surface area analysis (CI/CD + Resource Exhaustion) — full evidence tables and causal chain testing
- Open opportunity areas (ranked) — component-level gap analysis
- Framing Guide — How to frame ideas that survive overlap review, with real before/after case studies
- Overlap Review Calibration Guide — For reviewers: how to evaluate ideas consistently, interpret bot output, and give constructive feedback
Last updated: 2026-02-18. If you're reading this more than a few weeks after this date, check with reviewers — new tasks may have filled some of these gaps.