zheng022/discussion.md

## discussion.md

      
    Raw
  

              discussion.md
            
          
    🚀 Improving the GHES Manage Developer Experience — What We Did and What's Changed

You've told us this was hard!!

Building new APIs in GHES Manage has too often felt like a challenging expedition: confusing paths, hidden "gotchas," and the occasional moment where you're just… banging your head against the desk wondering why something that should be straightforward is taking so long.
GHES Manage was created three years ago as a modern alternative to the legacy Enterprise Management Console, but adoption hasn't met our expectations.  We've heard consistent feedback that implementing new API endpoints can be challenging and unintuitive.
So we've been focused on improving the GHES Manage API.

Here's a breakdown of what we shipped across the main workstreams:

🏗️ 1. Structural Refactoring — Making the Code Make Sense

One of the most consistent pieces of feedback was that the internal architecture was confusing — business data invisibly stuffed into Go contexts, providers that existed only to return constants, dead code scattered throughout, and a catch-all enum package that made it hard to find anything. If you've ever wondered "where does this data actually come from?", you weren't alone.
We tackled this head-on. Configuration and state data (GitHub config, cluster config, node state) is no longer smuggled through context.Context — it's now passed explicitly as typed structs, loaded close to where it's actually used. We cleaned out dead code and unused structs, and reorganized enums into proper domain packages with safe constructors.
The result: when you read the code now, the data flow is visible, the package structure reflects the domain, and there's a lot less "why does this exist?" noise to wade through.

📊 2. Observability — You Can't Fix What You Can't See

Before this work, GHES Manage was largely a black box in production. Gateway logs were noisy with health pings but lacked response times, and the agents emitted no service-level metrics at all. If something was slow or broken, you were mostly guessing.
We introduced OpenTelemetry metrics across both the gateway and agents. The agents now emit granular per-service metrics — call counts, durations, and status codes for every twirp method — with distinct prefixes (ghes_manage_agent_* vs ghes_manage_gateway_*) so you can immediately tell which component you're looking at. We also added meaningful tags to gateway metrics (user agent, status codes, handler names) to give better visibility into API usage patterns.
To tie it all together, we built a new Grafana dashboard with deeper telemetry into agent and gateway behavior across all nodes. This is a big step toward making GHES Manage supportable and debuggable in production — not just something you deploy and hope for the best.

To tie it all together, we built a new Grafana dashboard with deeper telemetry into agent and gateway behavior across all nodes. This is a big step toward making GHES Manage supportable and debuggable in production — not just something you deploy and hope for the best.

🧪 3. Testing & Code Quality — Raising the Bar

Testing in GHES Manage had some deep-rooted problems. Environment helpers like IsTesting() and IsProduction() leaked test-specific code paths into the production binary. Some code bypassed the internal shell abstraction entirely, and the lack of filesystem abstraction made some paths effectively untestable. The linting configuration also diverged from GitHub's standards, leading to inconsistent review feedback.
We overhauled the testing foundation: test data is now accessed through afero instead of relying on filesystem state, and all shell execution goes through a single mockable interface. We also aligned the golangci-lint configuration with GitHub's official go-linter standards, so the same rules run consistently across local development, IDE, and CI — no more "works on my machine" surprises during code review.
These aren't flashy changes, but they directly reduce the friction of writing and reviewing new code. Tests are easier to write, easier to trust, and cover broader paths — which is essential for development without a real running GHES instance.

🐛 4. Housekeeping & Developer Workflow

We fixed several bugs that had been lurking undetected. In the process, we also prompted the retirement of Cluster HA.
We streamlined the developer workflow by standardizing on ./script/rerun-bpdev as the single recommended development loop.
We finalized a step-by-step "How To" guide and checklist for building new endpoints.

👥 Team

A huge thank you to the team who made this happen:

@Jcambass — Led the structural refactoring and prompted the retirement of Cluster HA. North star of the project with an unwavering high standard.
@dob9601 — Led the testing framework overhaul
@BenedictNg1024 — Led the observability work
@zheng022 — Bug fixes, linting, data structure refactoring, housekeeping
@KaczDev — Early feedback that helped shape the epic
@manue1 — Essential decision maker for refactoring decisions and best consultant whenever questions came up


If you've worked with GHES Manage before and found it painful, we'd love to know if these changes help. And if there are other pain points we haven't addressed, please let us know — our goal is for GHES Manage to become the one-and-only API control plane for GHES.
No results found