Skip to content

Instantly share code, notes, and snippets.

@bentito
Last active January 21, 2026 18:55
Show Gist options
  • Select an option

  • Save bentito/434de177a54c37cfb2c7888d7b1bddbe to your computer and use it in GitHub Desktop.

Select an option

Save bentito/434de177a54c37cfb2c7888d7b1bddbe to your computer and use it in GitHub Desktop.
MCP Dev 2026 NYC Presentation Proposal

Presentation Proposal: MCP Dev Summit North America 2026

Title: Diagnosis with Agent Evals: Building an MCP Server for Network Troubleshooting

Session Type: 25-minute Session

Track: Security & Operations / MCP Best Practices

Abstract

Troubleshooting Kubernetes network ingress and DNS issues is notoriously complex, requiring deep domain knowledge and context hopping between multiple layers (HAProxy, CoreDNS, generic K8s resources). While LLMs promise to democratize this knowledge, giving them raw kubectl access is risky and often ineffective. In this session, we present our journey building and rigorously validating a production-grade NetEdge MCP server. We detail how we evolved from a "Phase 0" prototype using gen-mcp to a robust Go implementation, but more importantly, we ask the hard question: Do these specialized MCP tools actually help agents solve these networking problems better? To answer this, we used gevals, our agentic evaluation framework, to codify 6 real-world infrastructure failure scenarios (e.g., misconfigured backendRefs, stale DNS caches, selector mismatches). We ran extensive comparative trials to measure success rates and "Time to Diagnosis" across three distinct agent configurations:

  1. Baseline: Agent with raw kubectl / CLI access (Zero tooling).
  2. Generalist: Agent with generic Kubernetes MCP tools (Create/Delete Pods, etc.).
  3. Specialized: Agent with our domain-specific NetEdge MCP toolset. Our results show that specialized tools don't just improve safety—they fundamentally change the agent's problem-solving trajectory. We will share these findings along with the dual-mode architecture that makes such rigorous evaluation possible.

Key Takeaways

  • Evaluation First: How to use gevals to engineer valid benchmarks for agent performance against real-world broken scenarios.
  • The Case for Specialization: Comparative data showing why domain-specific MCP tools outperform generalist K8s tools for complex troubleshooting.
  • Architecting for Testability: Designing MCP servers that work offline to enable reproducible agent benchmarks.

Session Outline

  1. The Problem Space (3 mins)
    • Why K8s networking is hard (layers: Ingress -> Service -> Pod -> DNS).
    • The hypothesis: Specialized tools > Generic access.
  2. Architecting for Diagnosis (7 mins)
    • Evolution from CLI wrappers to Native Go tools (inspect_route, get_coredns_config).
    • The "NidsStore" Interface: Decoupling logic from data to enable offline replay.
    • The "Offline" Innovation: Why running agents against static diagnostic snapshots is crucial for reproducible benchmarks.
  3. Proving It: Rigorous Evaluation with gevals (10 mins)
    • The Methodology: defining 6 "Broken Cluster" scenarios based on real support tickets.
    • The Showdown:
      • Scenario A: Agent vs. kubectl (The "Hallucination Hazard").
      • Scenario B: Agent vs. Generic K8s MCP (The "Context Overload").
      • Scenario C: Agent vs. NetEdge MCP (The "Guided Path").
    • Results: Metrics on success rate, step count, and safety violations.
  4. Future of Agentic Ops (3 mins)
    • Using these benchmarks to drive future tool design.
    • Call to action: Stop building tools; start building evaluations.
  5. Q&A (2 mins)

Intended Audience

DevOps Engineers, Platform Builders, and MCP Tool Developers looking to build robust, secure, and testable server implementations for complex infrastructure.

@candita
Copy link

candita commented Jan 21, 2026

Overall, a nice proposal but it covers alot, maybe too much. It's almost enough for two talks. Also, concluding with a call to action for evals but not mentioning evals in the title is dissonant. It might be nice to shorten the title and add "evals" and "troubleshooting" to it. E.g. "Diagnosis with agent evals: Building an MCP Server for Network Troubleshooting". You may want to drop the mention of the OpenShift tool (must-gather), which afaik isn't used in Kubernetes.

@bentito
Copy link
Author

bentito commented Jan 21, 2026

ty @candita I've updated based on your feedback, give it another read if you have a chance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment