Skip to content

Instantly share code, notes, and snippets.

@dims
Created January 15, 2026 12:43
Show Gist options
  • Select an option

  • Save dims/ca6d524c510add583a1d0b28e27a6154 to your computer and use it in GitHub Desktop.

Select an option

Save dims/ca6d524c510add583a1d0b28e27a6154 to your computer and use it in GitHub Desktop.
CNS (Cloud Native Stack) CLI UX Analysis

CNS (Cloud Native Stack) CLI UX Analysis - v4.0

Document Version: 4.3 Generated: 2026-01-14 Last Updated: 2026-01-15 Codebase Branch: main Upstream Commit: a68ee61 Analyzer: Claude Opus 4.5


Executive Summary

This document provides a comprehensive UX analysis of the CNS CLI tool (cnsctl), covering CLI design patterns, agent deployment security, recipe system coverage, bundler functionality, collector subsystems, and developer experience. This v4.0 is a complete fresh analysis with deep context for each issue.

Key Findings Summary

Priority Open Fixed Wontfix Total
Critical 1 3 1 5
High 0 12 7 19
Medium 8 4 4 16
Low 18 0 0 18
Total 27 19 12 58

Legend: Open = no action taken, Fixed = merged PR, Wontfix = deliberately not fixing

Open PRs (as of 2026-01-15)

PR Description Status Our Work?
#5 Add OCI Build and Push functionality OPEN No

Merged PRs from This Analysis (Our Work)

PR Issue Description Status
#35 M27 Fix image registry in example Job manifest ✅ MERGED
#34 M3 Add examples to recipe and bundle command help ✅ MERGED
#33 (E2E) Log when CLI flags override snapshot-detected criteria ✅ MERGED
#32 H25 Use SSA for atomic ConfigMap updates ✅ MERGED
#31 H8 Warn when using base-only config ✅ MERGED
#30 H13 Default --fail-on-error to true ✅ MERGED
#29 H23 Enable kubeconfig support for bundle command ✅ MERGED
#27 C1 Add --privileged flag for PSS compliance ✅ MERGED
#24 H16 Return error instead of silent fallback ✅ MERGED

Other Merged PRs (Reference)

PR Description Status
#28 Dependency upgrades ✅ MERGED
#26 Add Flox Env for Dev Tooling ✅ MERGED
#23 Add --image-pull-secret flag ✅ MERGED
#22 Add info logging to collectors ✅ MERGED
#21 Add make image target ✅ MERGED
#20 Stream agent Job logs during wait ✅ MERGED
#19 Graceful degradation when D-Bus unavailable ✅ MERGED
#18 Graceful degradation when nvidia-smi missing ✅ MERGED
#17 Case-insensitive bundle type with typo suggestions ✅ MERGED
#16 Improve resource cleanup error handling ✅ MERGED
#15 Fix agent-deployment.md documentation ✅ MERGED
#14 Add validation for recipe criteria ✅ MERGED
#12 Standardize CLI flag aliases ✅ MERGED

What's New in v4.0

  • Complete fresh analysis with updated codebase (commit 5620b0d)
  • Deep context added for every issue explaining why it matters
  • Identified 4 new issues (H22-H25) from deep analysis
  • Added documentation/Makefile/YAML analysis findings
  • Created 6 PRs: #24, #27, #29, #30, #31, #32
  • Corrected issue counts: 55 total issues (was incorrectly stated as 54)
  • More comprehensive call graphs and architecture diagrams

Table of Contents

  1. Command Architecture
  2. CLI Flag Analysis
  3. Call Graphs
  4. Agent Deployment System
  5. Recipe System
  6. Bundler System
  7. Collector System
  8. Serializer System
  9. Documentation Analysis
  10. Build System Analysis
  11. Issue Catalog
  12. UX Improvement Roadmap
  13. Appendices

1. Command Architecture

1.1 Command Hierarchy

cnsctl (root)
├── snapshot    - Capture system configuration snapshot
├── recipe      - Generate configuration recipe from criteria
├── bundle      - Generate artifact bundle from recipe
├── validate    - Validate cluster against recipe constraints
├── completion  - Shell completion scripts (visible since PR #8)
└── version     - Display version information

1.2 Global Flags

Flag Type Default Env Var Description
--debug bool false CNS_DEBUG Enable debug logging
--log-json bool false CNS_LOG_JSON Enable structured JSON logging

1.3 Shared Flags

Flag Alias Type Default Used By
--output -o string stdout snapshot, recipe, validate, bundle
--format -f string yaml snapshot, recipe, validate
--kubeconfig -k string (auto) snapshot, recipe, validate

1.4 Command Flow Overview

User Request
    │
    ├─► snapshot ─► Collectors (GPU/K8s/OS/SystemD) ─► Serializer ─► Output
    │       │
    │       └─► [--deploy-agent] ─► K8s Job ─► ConfigMap
    │
    ├─► recipe ─► Criteria ─► Overlay Matcher ─► Merger ─► RecipeResult
    │       │
    │       └─► [--snapshot] ─► Extract criteria from snapshot
    │
    ├─► validate ─► Load Recipe + Snapshot ─► Constraint Evaluator ─► Result
    │
    └─► bundle ─► Registry ─► Parallel Bundlers ─► Deployer ─► Files

2. CLI Flag Analysis

2.1 Snapshot Command Flags

File: pkg/cli/snapshot.go:19-174

Flag Alias Type Default Required Description
--deploy-agent - bool false No Deploy K8s Job for snapshot
--namespace - string gpu-operator No Agent namespace (env: CNS_NAMESPACE)
--image - string ghcr.io/nvidia/cns:latest No Agent image (env: CNS_IMAGE)
--image-pull-secret - []string [] No Image pull secrets for private registries
--job-name - string cns No K8s Job name
--service-account-name - string cns No ServiceAccount name
--node-selector - []string [] No Node selectors (key=value)
--toleration - []string [] No Tolerations (key=value:effect). Default: all taints tolerated
--timeout - duration 5m No Job completion timeout
--cleanup - bool true No Remove resources after completion
--output -o string stdout No Output destination
--format -f string yaml No Output format (yaml, json, table)
--kubeconfig -k string (auto) No Path to kubeconfig file

Key Observations:

  • The --cleanup flag defaults to true since PR/commit fixing it
  • --toleration when empty uses universal toleration (operator: Exists)
  • --kubeconfig flag is present but not used in local snapshot mode (only agent mode)

2.2 Recipe Command Flags

File: pkg/cli/recipe.go:21-145

Flag Alias Type Default Required Description
--service - string - No K8s service type (eks, gke, aks, oke)
--accelerator --gpu string - No GPU type (h100, gb200, a100, l40)
--intent - string - No Workload intent (training, inference)
--os - string - No OS type (ubuntu, rhel, cos, amazonlinux)
--nodes - int 0 No Number of GPU nodes
--snapshot -s string - No Path/URI to snapshot
--output -o string stdout No Output destination
--format -f string yaml No Output format
--kubeconfig -k string (auto) No Kubeconfig for ConfigMap access

Key Observations:

  • Either criteria flags OR --snapshot should be provided
  • If --snapshot provided, criteria are extracted from it
  • CLI criteria flags override snapshot-extracted values
  • Validation added in PR #14: at least one criteria required

2.3 Bundle Command Flags

File: pkg/cli/bundle.go:25-202

Flag Alias Type Default Required Description
--recipe -r string - Yes Path/URI to recipe
--bundlers -b []string [] No Bundler types to execute
--output -o string . No Output directory
--set - []string [] No Value overrides (bundler:path=value)
--system-node-selector - []string [] No System component node selectors
--system-node-toleration - []string [] No System component tolerations
--accelerated-node-selector - []string [] No GPU node selectors
--accelerated-node-toleration - []string [] No GPU node tolerations
--deployer - string script No Deployment method (script, argocd, flux)

Key Observations:

  • --output is a directory here vs file for other commands (H19)
  • Has --kubeconfig flag defined but never used in code (see H23)
  • --bundlers is case-insensitive since PR #17
  • When --bundlers empty, all registered bundlers execute

2.4 Validate Command Flags

File: pkg/cli/validate.go:20-167

Flag Alias Type Default Required Description
--recipe -r string - Yes Path/URI to recipe
--snapshot -s string - Yes Path/URI to snapshot
--fail-on-error - bool false No Exit non-zero on validation failure
--output -o string stdout No Output destination
--format -f string yaml No Output format
--kubeconfig -k string (auto) No Kubeconfig for ConfigMap access

Key Observations:

  • Both --recipe and --snapshot are required
  • Without --fail-on-error, validation failures return exit code 0 (H13)
  • Supports ConfigMap URIs for both inputs

2.5 Flag Consistency Analysis

Aspect Commands Status Issue
--format alias -f All with format ✅ Consistent Fixed in PR #12
--output alias -o All ✅ Consistent -
--output meaning bundle=dir, others=file ❌ Inconsistent H19 (WONTFIX)
--kubeconfig alias -k snapshot, recipe, validate ✅ Consistent -
--kubeconfig on bundle Defined but unused ❌ Dead code H23 (NEW)
Short alias for --deploy-agent snapshot ❌ Missing H2
Short alias for --fail-on-error validate ❌ Missing -

3. Call Graphs

3.1 Snapshot Command Call Graph

snapshotCmd() [pkg/cli/snapshot.go:19]
│
├─► Parse CLI flags [snapshot.go:119-168]
│   ├─ serializer.Format(cmd.String("format"))
│   │   └─ Returns JSON, YAML, or Table format
│   ├─ collector.NewDefaultFactory(collector.WithVersion(version))
│   │   └─ Creates factory for GPU, K8s, OS, SystemD collectors
│   └─ snapshotter.NodeSnapshotter{} initialization
│       ├─ Version, Factory, Serializer configured
│       └─ AgentConfig set if --deploy-agent
│
└─► ns.Measure(ctx) [pkg/snapshotter/snapshot.go:42]
    │
    ├─ IF AgentConfig.Enabled:
    │  └─► n.measureWithAgent(ctx) [pkg/snapshotter/agent.go:126-223]
    │      ├─► k8sclient.GetKubeClient(kubeconfig)
    │      │   └─ Returns cached clientset, restconfig
    │      ├─► agent.NewDeployer(clientset, config, opts...)
    │      │   └─ Configures namespace, image, nodeSelector, etc.
    │      │
    │      ├─► deployer.Deploy(ctx) [pkg/k8s/agent/deployer.go:13-47]
    │      │   ├─► d.CheckPermissions(ctx) [permissions.go:11-76]
    │      │   │   └─ SelfSubjectAccessReview for each required permission
    │      │   ├─► d.ensureServiceAccount(ctx) [rbac.go:16-25]
    │      │   ├─► d.ensureRole(ctx) [rbac.go:27-49]
    │      │   ├─► d.ensureRoleBinding(ctx) [rbac.go:51-77]
    │      │   ├─► d.ensureClusterRole(ctx) [rbac.go:79-110]
    │      │   │   └─ **HARDCODED name: "cns-node-reader"** (H24)
    │      │   ├─► d.ensureClusterRoleBinding(ctx) [rbac.go:112-143]
    │      │   │   └─ **HARDCODED name: "cns-node-reader"** (H24)
    │      │   └─► d.ensureJob(ctx) [job.go:12-31]
    │      │       └─► d.buildJob(ctx) [job.go:33-138]
    │      │           ├─ Builds privileged pod spec
    │      │           ├─ Sets nodeSelector, tolerations
    │      │           └─ Adds volume mounts for /run/systemd
    │      │
    │      ├─► deployer.WaitForJobCompletion(ctx, timeout) [wait.go:13-93]
    │      │   ├─ Watch Job status
    │      │   ├─► WaitForPodReady(ctx) [wait.go:96-147]
    │      │   │   └─ Detect pod errors: CrashLoopBackOff, ImagePullBackOff, etc.
    │      │   └─► StreamLogs(ctx) [wait.go:150-195]
    │      │       └─ Stream pod logs with [agent] prefix
    │      │
    │      ├─► deployer.GetSnapshot(ctx) [deployer.go:119-166]
    │      │   └─ Read from ConfigMap, parse YAML
    │      │
    │      └─ defer: deployer.Cleanup(ctx, opts) [deployer.go:72-117]
    │          └─ Delete Job, SA, Role, RoleBinding, ClusterRole, ClusterRoleBinding
    │
    └─ ELSE (local mode):
       └─► n.measure(ctx) [pkg/snapshotter/snapshot.go:53-193]
           ├─ errgroup.WithContext(ctx)
           │
           ├─ g.Go: metadata collection
           │   └─ Hostname, timestamp, version
           │
           ├─ g.Go: k8sCollector.Collect(gctx)
           │   └─ [pkg/collector/k8s/k8s.go]
           │       ├─ Server version from /version
           │       ├─ Pod images from all namespaces
           │       ├─ ClusterPolicy from nvidia.com
           │       └─ Node info (first node)
           │
           ├─ g.Go: systemdCollector.Collect(gctx)
           │   └─ [pkg/collector/systemd/systemd.go]
           │       └─ D-Bus queries for containerd, docker, kubelet
           │       └─ **Graceful degradation** if D-Bus unavailable (PR #19)
           │
           ├─ g.Go: osCollector.Collect(gctx)
           │   └─ [pkg/collector/os/os.go]
           │       ├─ /proc/cmdline (grub params)
           │       ├─ /proc/modules (kmod)
           │       ├─ /proc/sys/* (sysctl)
           │       └─ /etc/os-release
           │
           ├─ g.Go: gpuCollector.Collect(gctx)
           │   └─ [pkg/collector/gpu/gpu.go]
           │       └─ nvidia-smi -q -x
           │       └─ **Graceful degradation** if nvidia-smi missing (PR #18)
           │
           ├─ g.Wait()
           │   └─ Fail-fast on first error
           │
           └─► n.Serializer.Serialize(ctx, snap)
               └─ Output to file, ConfigMap, or stdout

3.2 Recipe Command Call Graph

recipeCmd() [pkg/cli/recipe.go:21]
│
├─► Parse CLI flags
│   └─ serializer.Format(cmd.String("format"))
│
├─► recipe.NewBuilder(recipe.WithVersion(version))
│
├─ IF --snapshot provided:
│  │
│  ├─► serializer.FromFileWithKubeconfig[Snapshot](path, kubeconfig)
│  │   └─ Supports: file path, HTTP/HTTPS URL, cm://namespace/name
│  │
│  ├─► extractCriteriaFromSnapshot(snap) [recipe.go:170-268]
│  │   │
│  │   ├─ TypeK8s → Service detection
│  │   │   ├─ Check K8s.server.version for "-eks-", "-gke", "-aks"
│  │   │   └─ Map to CriteriaServiceEKS, etc.
│  │   │
│  │   ├─ TypeGPU → Accelerator detection
│  │   │   └─ Check gpu.model for "h100", "gb200", "a100", "l40"
│  │   │
│  │   └─ TypeOS → OS detection
│  │       └─ Check OS.release.ID
│  │
│  └─► applyCriteriaOverrides(cmd, criteria) [recipe.go:270-304]
│      └─ CLI flags override snapshot-extracted values
│
├─ ELSE:
│  └─► buildCriteriaFromCmd(cmd) [recipe.go:148-168]
│      └─► recipe.BuildCriteria(opts...)
│          └─ Validation: at least one criteria required (PR #14)
│
└─► builder.BuildFromCriteria(ctx, criteria) [pkg/recipe/builder.go:42-95]
    │
    ├─► loadMetadataStore(ctx) [pkg/recipe/metadata_store.go:39-135]
    │   ├─ fs.WalkDir(metadataFS, "data")
    │   │   └─ Embedded files: base.yaml + overlay/*.yaml
    │   ├─ Parse base.yaml
    │   │   └─ Default components, constraints
    │   └─ Parse overlay/*.yaml files
    │       └─ Environment-specific configurations
    │
    └─► store.BuildRecipeResult(ctx, criteria) [metadata_store.go:169-225]
        │
        ├─► store.FindMatchingOverlays(criteria)
        │   └─ For each overlay:
        │       └─ overlay.Spec.Criteria.Matches(criteria)
        │           └─ Specificity scoring (0-5 points)
        │
        ├─ Merge base with overlays (specificity order)
        │   └─ Lower specificity first, then higher
        │
        ├─ mergedSpec.ValidateDependencies()
        │   └─ Check all dependencyRefs resolve
        │
        ├─ mergedSpec.TopologicalSort()
        │   └─ Order by deploymentOrder
        │
        └─ Return RecipeResult with:
            ├─ Criteria (input + detected)
            ├─ ComponentRefs (with values, overrides)
            ├─ Constraints (validation rules)
            └─ Metadata (appliedOverlays, version)

3.3 Bundle Command Call Graph

bundleCmd() [pkg/cli/bundle.go:25]
│
├─► Parse CLI flags
│   ├─► config.ParseValueOverrides(--set flags)
│   │   └─ Parse "bundler:path.to.field=value" format
│   ├─► snapshotter.ParseNodeSelectors()
│   ├─► snapshotter.ParseTolerations()
│   └─ Validate deployer type: script, argocd, flux
│
├─► serializer.FromFile[RecipeResult](recipePath)
│   └─ **NOTE: Does NOT use kubeconfig flag** (H23)
│
├─► registry.NewFromGlobal(config) [pkg/bundler/registry/registry.go]
│   └─ Auto-registered bundlers via init():
│       ├─ certmanager [pkg/component/certmanager/]
│       ├─ gpuoperator [pkg/component/gpuoperator/]
│       ├─ networkoperator [pkg/component/networkoperator/]
│       ├─ nvsentinel [pkg/component/nvsentinel/]
│       └─ skyhook [pkg/component/skyhook/]
│
├─► bundler.New(opts...) [pkg/bundler/bundler.go:136-173]
│   └─ Apply overrides, node selectors, tolerations
│
└─► b.Make(ctx, recipe, outputDir) [bundler.go:180-244]
    │
    ├─ Validate input (non-nil recipe)
    │
    ├─ Create output directory
    │
    ├─► b.selectBundlers(input, types) [bundler.go:389-425]
    │   └─ If types empty, select all registered bundlers
    │
    ├─► b.makeParallel(ctx, input, dir, bundlers) [bundler.go:248-334]
    │   └─ errgroup.WithContext(ctx)
    │       └─ For each bundler (concurrent):
    │           └─► b.executeBundler(ctx, type, bundler, input, dir)
    │               ├─► bundler.Validate(ctx, input)
    │               │   └─ Check component exists in recipe
    │               └─► bundler.Make(ctx, input, dir)
    │                   ├─ GetComponentRef(name)
    │                   ├─ GetValuesForComponent(name)
    │                   │   └─ Merge: base → valuesFile → overrides → CLI --set
    │                   ├─ CreateBundleDir(subdir)
    │                   ├─ GenerateFileFromTemplate(values.yaml)
    │                   ├─ GenerateFileFromTemplate(install.sh)
    │                   ├─ GenerateFileFromTemplate(uninstall.sh)
    │                   ├─ GenerateFileFromTemplate(README.md)
    │                   └─ GenerateResult() with checksums
    │
    └─► b.createRootArtifacts(ctx, input, dir) [bundler.go:430-461]
        ├─► b.writeRecipeFile(recipe, dir)
        │   └─ Copy recipe.yaml to output
        └─► deployer.Generate(ctx, recipe, dir)
            ├─ ArgoCD: app-of-apps.yaml + Application CRs per component
            ├─ Flux: kustomization.yaml + HelmRelease CRs with dependsOn
            └─ Script: README.md with helm install commands

3.4 Validate Command Call Graph

validateCmd() [pkg/cli/validate.go:20]
│
├─► Parse CLI flags
│   └─ serializer.Format(cmd.String("format"))
│
├─► serializer.FromFileWithKubeconfig[RecipeResult](recipePath, kubeconfig)
│   └─ Supports: file, HTTP/HTTPS URL, cm://namespace/name
│
├─► serializer.FromFileWithKubeconfig[Snapshot](snapshotPath, kubeconfig)
│   └─ Supports: file, HTTP/HTTPS URL, cm://namespace/name
│
├─► validator.New(validator.WithVersion(version))
│
└─► v.Validate(ctx, recipe, snapshot) [pkg/validator/validator.go:49-108]
    │
    ├─► NewValidationResult()
    │
    ├─ For each recipe.Constraints:
    │   └─► v.evaluateConstraint(constraint, snap) [validator.go:111-185]
    │       │
    │       ├─► ParseConstraintPath(constraint.Name)
    │       │   └─ Split "{Type}.{Subtype}.{Key}"
    │       │
    │       ├─► path.ExtractValue(snap)
    │       │   └─ Find matching measurement.subtype.data[key]
    │       │
    │       ├─► ParseConstraintExpression(constraint.Value)
    │       │   └─ Parse operators: >=, <=, ==, !=, >, <
    │       │
    │       └─► parsed.Evaluate(actual)
    │           └─ Version comparison or string match
    │
    ├─ Calculate summary:
    │   ├─ Passed count
    │   ├─ Failed count
    │   ├─ Skipped count (missing data)
    │   └─ Overall status: pass/fail/partial
    │
    └─ Return ValidationResult

4. Agent Deployment System

4.1 Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    User Workstation                              │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  cnsctl snapshot --deploy-agent                              ││
│  │    │                                                        ││
│  │    ├─► CheckPermissions() ─ SelfSubjectAccessReview         ││
│  │    │   └─ Verifies: create configmaps, get nodes, etc.      ││
│  │    │                                                        ││
│  │    ├─► Deploy() ─ Create RBAC + Job                         ││
│  │    │   ├─ ServiceAccount, Role, RoleBinding (namespaced)    ││
│  │    │   ├─ ClusterRole, ClusterRoleBinding (cluster-scoped)  ││
│  │    │   │   └─ **HARDCODED names** (H24)                     ││
│  │    │   └─ Job with privileged pod                           ││
│  │    │                                                        ││
│  │    ├─► WaitForJobCompletion() ─ Watch Job status            ││
│  │    │   ├─ WaitForPodReady() with error detection            ││
│  │    │   │   └─ CrashLoopBackOff, ImagePullBackOff, etc.      ││
│  │    │   └─ StreamLogs() with [agent] prefix                  ││
│  │    │                                                        ││
│  │    ├─► GetSnapshot() ─ Read from ConfigMap                  ││
│  │    │   └─ Parse YAML from data.snapshot.yaml                ││
│  │    │                                                        ││
│  │    └─► Cleanup() ─ Delete resources (if --cleanup=true)     ││
│  │        └─ Attempts all deletions, reports errors            ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   Kubernetes Cluster                             │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Namespace: gpu-operator (default)                         │  │
│  │                                                            │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐       │  │
│  │  │ServiceAcct  │  │    Role     │  │ RoleBinding  │       │  │
│  │  │   "cns"     │  │   "cns"     │  │    "cns"     │       │  │
│  │  └─────────────┘  └─────────────┘  └──────────────┘       │  │
│  │                                                            │  │
│  │  ┌─────────────────────────────────────────────────────┐  │  │
│  │  │                   Job "cns"                          │  │  │
│  │  │  ┌───────────────────────────────────────────────┐  │  │  │
│  │  │  │  Pod (privileged, hostPID/Net/IPC, root)     │  │  │  │
│  │  │  │    ├─ GPU Collector (nvidia-smi)             │  │  │  │
│  │  │  │    ├─ K8s Collector (API client)             │  │  │  │
│  │  │  │    ├─ OS Collector (/proc, /etc)             │  │  │  │
│  │  │  │    └─ SystemD Collector (D-Bus)              │  │  │  │
│  │  │  └───────────────────────────────────────────────┘  │  │  │
│  │  └─────────────────────────────────────────────────────┘  │  │
│  │                         │                                  │  │
│  │                         ▼                                  │  │
│  │  ┌─────────────────────────────────────────────────────┐  │  │
│  │  │         ConfigMap "cns-snapshot"                     │  │  │
│  │  │  labels:                                             │  │  │
│  │  │    app.kubernetes.io/name: cns                       │  │  │
│  │  │    app.kubernetes.io/component: snapshot             │  │  │
│  │  │    app.kubernetes.io/version: <version>              │  │  │
│  │  │  data:                                               │  │  │
│  │  │    snapshot.yaml: "<YAML content>"                   │  │  │
│  │  │    format: yaml                                      │  │  │
│  │  │    timestamp: "2026-01-14T10:30:00Z"                 │  │  │
│  │  └─────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Cluster-Scoped Resources                                  │  │
│  │  ┌───────────────────┐  ┌─────────────────────────────┐   │  │
│  │  │    ClusterRole    │  │    ClusterRoleBinding       │   │  │
│  │  │ "cns-node-reader" │  │   "cns-node-reader"         │   │  │
│  │  │   (HARDCODED!)    │  │     (HARDCODED!)            │   │  │
│  │  └───────────────────┘  └─────────────────────────────┘   │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

4.2 RBAC Permissions

Namespace-Scoped Role (pkg/k8s/agent/rbac.go:27-49)

Resource Verbs Purpose
configmaps create, get, update, patch Store snapshot data
pods get, list Monitor Job pod status
pods/log get Stream pod logs

Cluster-Scoped ClusterRole (pkg/k8s/agent/rbac.go:79-110)

Resource API Group Verbs Purpose
nodes "" get, list Query node info
pods "" get, list List all pods
services "" get, list List all services
clusterpolicies nvidia.com get, list NVIDIA GPU policies

Issue H24: ClusterRole/ClusterRoleBinding names are hardcoded to "cns-node-reader"

  • Cannot customize via --job-name or --service-account-name
  • Multiple concurrent deployments in different namespaces share same cluster resources
  • Cleanup in one namespace may affect another

4.3 Security Context (pkg/k8s/agent/job.go:80-138)

Setting Value Security Implication
runAsUser 0 (root) Full system access
privileged true Bypass container isolation
hostPID true See all host processes
hostNetwork true Access host network
hostIPC true Access host IPC
capabilities SYS_ADMIN, SYS_CHROOT System-level operations

Why Privileged is Required:

  • nvidia-smi: Requires access to GPU devices
  • D-Bus: Requires access to system D-Bus socket
  • /proc files: Requires host PID namespace
  • SystemD properties: Requires host IPC namespace

Our PR #27 adds --privileged flag to allow unprivileged mode for PSS-restricted clusters.

4.4 Resource Requirements (pkg/k8s/agent/job.go:97-107)

Resource Request Limit
CPU 1 2
Memory 4Gi 8Gi
Ephemeral Storage 2Gi 4Gi

Issue M22: These values are hardcoded, no flags to customize.

4.5 Pod Error Detection (pkg/k8s/agent/wait.go:96-147)

The wait logic now detects these pod failure conditions:

  • ImagePullBackOff
  • ErrImagePull
  • InvalidImageName
  • CrashLoopBackOff
  • CreateContainerError
  • CreateContainerConfigError
  • RunContainerError

Each returns a clear error message with the reason.

4.6 Security Considerations

Attack Surface:

  1. Privileged container - Can escape container to host
  2. Host namespace access - Can observe all system activity
  3. Root execution - Full node access
  4. RBAC persistence - Cluster-scoped resources persist if cleanup fails

Mitigations:

  1. Permission check before deployment (CheckPermissions())
  2. Automatic cleanup on completion (default: enabled)
  3. Resource limits prevent DoS
  4. Single execution (BackoffLimit: 0)
  5. Hard timeout (ActiveDeadlineSeconds: 18000 = 5 hours)
  6. Pod error detection with clear messages

5. Recipe System

5.1 Recipe Structure (RecipeResult)

kind: RecipeResult
apiVersion: cns.nvidia.com/v1alpha1
metadata:
  generatedAt: "2026-01-14T10:00:00Z"
  version: "v0.19.0"
  appliedOverlays:
    - gb200-eks-ubuntu-training
criteria:
  service: eks
  accelerator: gb200
  intent: training
  os: ubuntu
  nodes: 8
componentRefs:
  - name: cert-manager
    type: Helm
    chart: cert-manager
    version: v1.16.2
    repository: https://charts.jetstack.io
    namespace: cert-manager
    deploymentOrder: 1
    valuesFile: components/cert-manager/values.yaml
    overrides:
      installCRDs: true
  - name: gpu-operator
    type: Helm
    chart: gpu-operator
    version: v25.3.4
    repository: https://helm.ngc.nvidia.com/nvidia
    namespace: gpu-operator
    deploymentOrder: 2
    valuesFile: components/gpu-operator/eks-gb200-training.yaml
    dependencyRefs:
      - cert-manager
constraints:
  - name: K8s.server.version
    value: ">= 1.32"
  - name: OS.release.ID
    value: ubuntu
deploymentOrder:
  - cert-manager
  - gpu-operator
  - network-operator
  - nvsentinel
  - skyhook

5.2 Overlay System

File Structure:

pkg/recipe/data/
├── base.yaml                      # Default components and settings
└── *.yaml                         # Overlay files (3 currently)
    ├── gb200-eks-ubuntu-training.yaml
    ├── h100-eks-ubuntu-training.yaml
    └── h100-ubuntu-inference.yaml

Matching Algorithm:

  1. Load all overlays from pkg/recipe/data/*.yaml
  2. For each overlay, check if criteria matches request
  3. Collect all matching overlays
  4. Sort by specificity score (ascending)
  5. Merge: base → less specific → more specific

Specificity Scoring:

  • Each non-"any" field adds 1 point
  • Fields: service, accelerator, intent, os (nodes is optional)
  • Score range: 0-4 (or 0-5 with nodes)

Example:

Query: { service: eks, accelerator: gb200, os: ubuntu, intent: training }

Overlay 1: { service: eks }                          → Score 1, MATCH
Overlay 2: { service: eks, accelerator: gb200 }      → Score 2, MATCH
Overlay 3: { accelerator: h100 }                     → Score 1, NO MATCH

Merge order: base → Overlay 1 → Overlay 2

5.3 Coverage Analysis

Supported Criteria Values:

Criteria Values Count
Services eks, gke, aks, oke 4
Accelerators h100, gb200, a100, l40 4
Intents training, inference 2
OS ubuntu, rhel, cos, amazonlinux 4

Total Specific Combinations: 4 × 4 × 2 × 4 = 128

Current Overlays (3 files):

Overlay Service Accelerator OS Intent Specificity
gb200-eks-ubuntu-training eks gb200 ubuntu training 4/4
h100-eks-ubuntu-training eks h100 ubuntu training 4/4
h100-ubuntu-inference any h100 ubuntu inference 3/4

Coverage: 3/128 = 2.34% (Issue C2)

5.4 Coverage Gaps

Gap Category Missing Combinations Count
A100 accelerator All A100 combinations 32
L40 accelerator All L40 combinations 32
GB200 non-EKS gke/aks/oke + gb200 24
GB200 inference Any service + gb200 + inference 16
Non-Ubuntu OS rhel/cos/amazonlinux + any 96
H100 training non-EKS gke/aks/oke + h100 + training 3
GKE service All GKE combinations 32
AKS service All AKS combinations 32
OKE service All OKE combinations 32

Impact: Most user queries fall back to base configuration only, missing environment-specific optimizations.


6. Bundler System

6.1 Component Bundlers

Bundler Bundle Type Key Outputs
cert-manager cert-manager values.yaml, install.sh, README
gpu-operator gpu-operator values.yaml, clusterpolicy.yaml, scripts
network-operator network-operator values.yaml, scripts, README
skyhook skyhook values.yaml, customization CRs, scripts
nvsentinel nvsentinel values.yaml, scripts, README

6.2 Bundler Registration Pattern

Each bundler self-registers via init():

// pkg/component/gpuoperator/bundler.go
func init() {
    registry.MustRegister(Name, NewBundler())
}

const Name = types.BundleType("gpu-operator")

Bundler Interface:

type Bundler interface {
    Type() BundleType
    Make(ctx context.Context, input *recipe.RecipeResult, outputDir string) (*Result, error)
}

type ValidatableBundler interface {
    Bundler
    Validate(ctx context.Context, input *recipe.RecipeResult) error
}

6.3 Value Override System (--set)

Format: --set bundler:path.to.field=value

Merge Precedence (lowest to highest):

  1. Base values (from recipe data)
  2. valuesFile content
  3. Recipe overrides field
  4. CLI --set flags

Example Paths by Bundler:

GPU Operator

gpuoperator:operator.nodeSelector=key=value
gpuoperator:daemonsets.nodeSelector=key=value
gpuoperator:dcgmExporter.config.create=true
gpuoperator:gds.enabled=true
gpuoperator:driver.version=570.133.20
gpuoperator:cdi.enabled=true
gpuoperator:mig.strategy=mixed

Network Operator

networkoperator:operator.repository=myregistry.com
networkoperator:ofedDriver.version=23.04
networkoperator:ofedDriver.deploy=true
networkoperator:rdma.enabled=true
networkoperator:sriov.enabled=true

Cert-Manager

certmanager:installCRDs=true
certmanager:nodeSelector=key=value
certmanager:tolerations=...
certmanager:webhook.nodeSelector=key=value

Skyhook

skyhook:manager.resources.cpu.limit=2
skyhook:manager.resources.memory.limit=2Gi
skyhook:customization=ubuntu
skyhook:controllerManager.selectors=key=value

NVSentinel

nvsentinel:namespace=nvsentinel
nvsentinel:sentinel.enabled=true
nvsentinel:sentinel.logLevel=info
nvsentinel:global.systemNodeSelector=key=value

Limitations:

  • No array index override syntax (e.g., tolerations[0].key=value)
  • No wildcard paths
  • Type conversion is automatic (strings → bool/int where appropriate)

6.4 Deployer Types

Type Outputs Use Case
script README with helm commands, install.sh Manual deployment
argocd app-of-apps.yaml, Application CRs GitOps with ArgoCD
flux kustomization.yaml, HelmRelease CRs GitOps with Flux

Deployment Order Handling:

  • Script: Documents order in README
  • ArgoCD: Uses argocd.argoproj.io/sync-wave annotations
  • Flux: Uses spec.dependsOn fields

7. Collector System

7.1 Factory Pattern

File: pkg/collector/factory.go

type Factory interface {
    CreateSystemDCollector() Collector
    CreateOSCollector() Collector
    CreateKubernetesCollector() Collector
    CreateGPUCollector() Collector
}

7.2 Collector Details

Collector Data Sources Key Outputs Graceful Degradation
GPU nvidia-smi -q -x driver version, CUDA, GPU model, memory, count Yes (PR #18) - returns gpu.count=0
K8s Kubernetes API server version, images, policies, node info No - requires API access
OS /proc, /etc kernel, OS release, sysctl, modules No - requires /proc access
SystemD D-Bus service status (containerd, docker, kubelet) Yes (PR #19) - empty if D-Bus unavailable

7.3 GPU Collector Details (pkg/collector/gpu/gpu.go)

Data Collection:

  1. Execute nvidia-smi -q -x
  2. Parse XML output
  3. Extract:
    • Driver version
    • CUDA version
    • GPU count
    • GPU model (per-GPU)
    • Memory info
    • MIG configuration

Graceful Degradation (since PR #18):

if errors.Is(err, exec.ErrNotFound) || os.IsNotExist(err) {
    slog.Warn("nvidia-smi not found, returning empty GPU measurements")
    return &measurement.Measurement{
        Type: measurement.TypeGPU,
        Subtypes: []measurement.Subtype{{
            Name: "smi",
            Data: map[string]measurement.Reading{
                "gpu.count": measurement.Int(0),
            },
        }},
    }, nil
}

7.4 K8s Collector Details (pkg/collector/k8s/k8s.go)

Data Collection:

  1. Get server version from /version endpoint
  2. List all pods, extract unique images
  3. Get ClusterPolicy CRDs (nvidia.com/v1)
  4. Get node info (first node only)

Limitations:

  • Only collects first node info (scalability issue for multi-node clusters)
  • Lists ALL pods across ALL namespaces (can be slow on large clusters)
  • No pagination for pod listing

7.5 OS Collector Details (pkg/collector/os/os.go)

Data Collection:

  1. /proc/cmdline → GRUB boot parameters
  2. /proc/modules → Loaded kernel modules
  3. /proc/sys/* → Sysctl parameters
  4. /etc/os-release → OS identification

Platform Assumptions:

  • Hardcoded Linux paths
  • Won't work on Windows or macOS (intentional - GPU nodes are Linux)

7.6 SystemD Collector Details (pkg/collector/systemd/systemd.go)

Data Collection:

  1. Connect to system D-Bus
  2. Query properties for:
    • containerd.service
    • docker.service
    • kubelet.service

Graceful Degradation (since PR #19):

  • Returns empty measurements if D-Bus unavailable
  • Logs warning but doesn't fail

7.7 Measurement Structure

type: GPU  # or K8s, OS, SystemD
subtypes:
  - name: smi
    data:
      driver-version: "570.133.20"
      cuda-version: "12.8"
      gpu.count: 8
      gpu.model: "NVIDIA H100"
    context:
      driver-version: "NVIDIA driver version installed on the system"

8. Serializer System

8.1 Output Formats

Format Extension Description Read Support Write Support
json .json Pretty-printed JSON Yes Yes
yaml .yaml YAML with 2-space indent Yes Yes
table - Flattened key-value table No Yes only

8.2 Output Destinations

Destination Format Example Implementation
File path /tmp/snapshot.yaml writer.go:NewFileWriter
ConfigMap cm://namespace/name cm://default/cns-snapshot configmap.go
HTTP URL https://... https://example.com/snap.yaml http.go (read only)
Stdout - or empty Default writer.go:NewStdoutWriter

8.3 ConfigMap Storage (pkg/serializer/configmap.go)

Write Flow (after PR #32 - Server-Side Apply):

func (w *ConfigMapWriter) Serialize(ctx context.Context, data any) error {
    // 1. Marshal data to YAML/JSON
    content, err := serializeYAML(data)

    // 2. Build ConfigMap apply configuration
    configMap := accorev1.ConfigMap(w.name, w.namespace).
        WithLabels(map[string]string{
            "app.kubernetes.io/name":      "cns",
            "app.kubernetes.io/component": "snapshot",
            "app.kubernetes.io/version":   version,
        }).
        WithData(map[string]string{
            "snapshot.yaml": string(content),
            "format":        string(w.format),
            "timestamp":     time.Now().UTC().Format(time.RFC3339),
        })

    // 3. Atomic Server-Side Apply (creates or updates)
    _, err = client.CoreV1().ConfigMaps(w.namespace).Apply(
        ctx,
        configMap,
        metav1.ApplyOptions{FieldManager: "cnsctl"},
    )
    return err
}

Key improvement: PR #32 replaced the race-prone Get-then-Create/Update pattern with atomic Server-Side Apply (SSA), eliminating data loss in concurrent writes.

Issues (Status):

  • Race condition: Get-then-Create is not atomic ✅ Fixed by PR #32 (SSA)
  • No context timeout: Long-running writes can block indefinitely (has 30s timeout)
  • Silent fallback: Invalid paths silently fall back to stdout 🔄 PR #24 OPEN

8.4 URI Parsing

func ParseURI(uri string) (scheme, namespace, name string, err error) {
    // Supports:
    // - cm://namespace/name     → ConfigMap
    // - https://example.com/... → HTTP
    // - /path/to/file           → File
    // - -                       → Stdout
}

9. Documentation Analysis

9.1 Documentation Structure

docs/
├── OVERVIEW.md                    # High-level product overview
├── architecture/
│   ├── README.md                  # Architecture overview (1264 lines!)
│   ├── api-server.md              # API server architecture
│   ├── cli.md                     # CLI architecture
│   ├── component.md               # Bundler component guide
│   └── data.md                    # Recipe data architecture
├── demos/
│   ├── e2e.md                     # End-to-end demo
│   └── s3c.md                     # S3C demo
├── integration/
│   ├── api-reference.md           # API reference (695 lines)
│   ├── automation.md              # CI/CD integration
│   ├── data-flow.md               # Data flow documentation
│   ├── kubernetes-deployment.md   # K8s deployment guide
│   └── recipe-development.md      # Recipe development guide
└── user-guide/
    ├── agent-deployment.md        # Agent deployment guide (900 lines)
    ├── api-reference.md           # User-facing API reference
    ├── cli-reference.md           # CLI reference (900 lines)
    └── installation.md            # Installation guide

9.2 Documentation Quality Assessment

Document Lines Quality Issues
architecture/README.md 1264 Excellent Very comprehensive, good diagrams
user-guide/cli-reference.md 900 Excellent Complete flag documentation
user-guide/agent-deployment.md 900 Good Fixed in PR #15
integration/api-reference.md 695 Good Complete API documentation
architecture/data.md 865 Excellent Detailed overlay system explanation
integration/recipe-development.md 650 Good Helpful for contributors

9.3 Documentation Findings

Strengths:

  • Comprehensive CLI reference with all flags documented
  • Good architecture documentation with mermaid diagrams
  • Clear examples in most documents
  • Recipe data architecture well explained

Gaps:

  • No changelog (M26)
  • No quick start guide (L12)
  • No troubleshooting guide beyond basic tips
  • Some documents reference draft features

10. Build System Analysis

10.1 Makefile Targets

File: Makefile (145 lines)

Target Description Dependencies
info Print project info -
tidy Update Go modules -
upgrade Upgrade all dependencies -
lint Lint Go and YAML lint-go, lint-yaml
lint-go Run golangci-lint -
lint-yaml Run yamllint -
test Run unit tests with race detector -
e2e Run integration tests tools/e2e
scan Vulnerability scan (go vet + grype) -
qualify Full qualification test, lint, e2e, scan
server Start development server -
docs Serve Go documentation -
build Build release binaries tidy
image Build and push container image -
release Run goreleaser -
bump-major/minor/patch Version bumping tools/bump
clean Clean directories -
help Show available targets -

10.2 Build Configuration

Go Version: Uses go env GOVERSION (documented in info)

Linting:

  • golangci-lint with .golangci.yaml config
  • yamllint with .yamllint.yaml config

Release:

  • goreleaser with .goreleaser.yaml config
  • Multi-platform binaries (darwin/linux, amd64/arm64)

Container Images:

  • Built with ko
  • Registry: ghcr.io/nvidia (configurable via IMAGE_REGISTRY)
  • Tag: latest (configurable via IMAGE_TAG)

10.3 Deployment YAMLs

Location: deployments/cns-agent/

File Purpose
1-deps.yaml RBAC resources (SA, Role, RoleBinding, ClusterRole, ClusterRoleBinding)
2-job.yaml Job manifest for agent deployment

1-deps.yaml Analysis:

  • Creates namespace-scoped RBAC (cns service account, role, rolebinding)
  • Creates cluster-scoped RBAC (cns-node-reader clusterrole, clusterrolebinding)
  • Includes secret list permission (potential security concern)

2-job.yaml Analysis:

  • Uses hardcoded nodeSelector: nodeGroup: customer-gpu
  • Uses specific tolerations for dedicated=user-workload
  • Image: ghcr.io/mchmarny/cns:latest (should be ghcr.io/nvidia/cns:latest)
  • Privileged security context

Issues Found:

  • Image points to mchmarny fork instead of nvidia (likely test config)
  • NodeSelector is environment-specific
  • Tolerations are environment-specific

11. Issue Catalog

11.1 Critical Issues (1 open, 3 fixed, 1 wontfix)

C1: Privileged Container Required for Snapshot

Status: ✅ FIXED (PR #27) File: pkg/k8s/agent/job.go:62-70 Impact: Cannot deploy on PSS-restricted clusters without exemption

Context: The agent Job requires privileged: true security context to:

  1. Access nvidia-smi for GPU metrics
  2. Read D-Bus socket for SystemD service status
  3. Access /proc files with host PID namespace

Why It Matters: Many enterprise Kubernetes clusters enforce Pod Security Standards (PSS) at "restricted" or "baseline" level, which prohibit privileged containers. This prevents CNS agent deployment without cluster policy exceptions.

Fix (PR #27): Adds --privileged flag (default: true) allowing --privileged=false for PSS-restricted environments. In unprivileged mode, GPU and SystemD collectors return empty/degraded results.


C2: Only 2.34% Overlay Coverage

Status: Open File: pkg/recipe/data/*.yaml Impact: Most configurations use base-only settings

Context: With only 3 overlay files covering 3/128 possible criteria combinations, most user queries fall through to the base configuration without environment-specific optimizations.

Why It Matters: The value proposition of CNS is hardware-aware, environment-specific configuration generation. Without overlays for A100, L40, GKE, AKS, OKE, or non-Ubuntu OS, users get generic configurations that may not be optimal for their environment.

Missing Coverage:

  • A100 GPUs (common in existing deployments)
  • L40 GPUs (common for inference)
  • GKE, AKS, OKE platforms (major cloud providers)
  • RHEL, COS, Amazon Linux (common enterprise OSes)
  • Inference workloads on GB200

C3: RBAC Cleanup May Fail Silently

Status: ✅ FIXED (PR #16) File: pkg/k8s/agent/deployer.go:72-117 Fix: Now attempts all deletions and reports errors


C4: No Validation for Unsupported Criteria Combos

Status: ✅ FIXED (PR #14) File: pkg/cli/recipe.go:110-112 Fix: Validates at least one criteria is provided


C5: No Bundle Validation Before Write

Status: ⏸️ WONTFIX Rationale: Input validation exists; output validation adds complexity without clear benefit


11.2 High Priority Issues (0 open, 12 fixed, 7 wontfix)

H1: --format Had Unintuitive -t Alias

Status: ✅ FIXED (PR #12) Fix: Now uses -f consistently


H2: No Short Alias for --deploy-agent

Status: ⏸️ WONTFIX Rationale: Verbosity is intentional for safety. --deploy-agent has significant side effects (creates K8s Job, RBAC, runs containers). Short aliases like -a or -d make accidental deployment too easy. The flag is typically used in scripts where verbosity doesn't hurt UX.


H6: No Progress Indicator During Job Wait

Status: ✅ PARTIALLY FIXED (PR #20) Fix: Log streaming now provides real-time output with [agent] prefix


H7: Criteria Validation Happens Late

Status: ⏸️ WONTFIX Rationale: Without --snapshot, validation is already immediate. With --snapshot, the snapshot must be loaded anyway to extract criteria. Moving enum validation to flag parsing requires custom flag types in urfave/cli v3, adding significant complexity for a narrow edge case.


H8: No Warning When Using Base-Only Config

Status: ✅ FIXED (PR #31) File: pkg/recipe/metadata_store.go:199-205

Fix: Added slog.Warn() when no overlays match criteria. Warning includes criteria used and hint about potential optimization gap. Example output: no environment-specific overlays matched, using base configuration only


H11: Bundler Name Case-Sensitive

Status: ✅ FIXED (PR #17) Fix: Now case-insensitive with typo suggestions


H12: No Suggestions for Failed Constraints

Status: ⏸️ WONTFIX Rationale: Constraint failures are environment-specific; generic suggestions would be misleading


H13: Exit Code Always 0 Unless --fail-on-error

Status: ✅ FIXED (PR #30) File: pkg/cli/validate.go

Fix: Changed --fail-on-error to default to true. Users can opt-out with --fail-on-error=false for informational mode.


H14: GPU Collector Fails Silently if nvidia-smi Missing

Status: ✅ FIXED (PR #18) Fix: Graceful degradation, returns gpu-count=0


H15: SystemD Collector Requires D-Bus Access

Status: ✅ FIXED (PR #19) Fix: Graceful degradation when D-Bus unavailable


H16: ConfigMap Write Silently Falls Back to Stdout

Status: ✅ FIXED (PR #24) File: pkg/serializer/writer.go:34-67

Fix: Returns an error instead of silent fallback when ConfigMap URI is invalid or inaccessible.


H17: agent-deployment.md Had Inaccuracies

Status: ✅ FIXED (PR #15)


H19: Output Flag -o Means File vs Directory

Status: ⏸️ WONTFIX Rationale: Changing this would break existing workflows; documented behavior


H20: Format Validation Happens in Action, Not Flag

Status: ⏸️ WONTFIX File: pkg/cli/snapshot.go:121-124 (and similar) Impact: Late error discovery

Context: Format validation (yaml, json, table) happens after the command starts executing, not during flag parsing.

Rationale: Validation happens as the first operation in Action handlers, so the practical impact is minimal. No expensive operations run before format validation.


H21: Job Logs Not Streamed During Wait

Status: ✅ FIXED (PR #20) Fix: Logs now streamed with [agent] prefix


H22: Recipe Command --snapshot Doesn't Support All URI Types

Status: ⏸️ WONTFIX File: pkg/cli/recipe.go:84-90 Impact: Inconsistent URI support

Context: The recipe command's --snapshot flag supports file paths, HTTP/HTTPS URLs, and ConfigMap URIs.

Rationale: Issue overstated - the flag documentation already clearly states "Supports: file paths, HTTP/HTTPS URLs, or ConfigMap URIs" and error messages are reasonably specific.


H23: Bundle Command Missing --kubeconfig Flag

Status: ✅ FIXED (PR #29) File: pkg/cli/bundle.go

Fix: Added kubeconfigFlag to the bundle command and uses FromFileWithKubeconfig to load recipes. Enables loading recipes from ConfigMap URIs.


H24: ClusterRole/ClusterRoleBinding Names Hardcoded

Status: ⏸️ WONTFIX Rationale: ClusterRole/ClusterRoleBinding are cluster-scoped and intentionally shared. Having a single "cns-node-reader" role is simpler and avoids role proliferation. The permissions are read-only and safe to share across namespaces.


H25: ConfigMap Race Condition

Status: ✅ FIXED (PR #32) File: pkg/serializer/configmap.go:109-132

Fix: Replaced Get-then-Create/Update with Kubernetes Server-Side Apply (SSA). Single atomic operation handles both create and update. Field ownership tracked via FieldManager: "cnsctl".


11.3 Medium Priority Issues (8 open, 4 fixed, 4 wontfix)

ID Category Issue Status
M1 CLI No command aliases (e.g., snap for snapshot) Open
M2 CLI Help text formatting inconsistent Open
M3 CLI No examples in command help ✅ FIXED (PR #34)
M4 CLI Error messages don't suggest fixes Open
M5 CLI No progress output for long operations ✅ PARTIALLY FIXED (PR #22)
M6 CLI --kubeconfig shown for all commands but not always used ⏸️ WONTFIX (inaccurate - all commands that have it use it; bundle missing it is H23)
M7 CLI Completion command hidden ✅ FIXED (PR #8)
M8 Recipe Overlay files not validated at load time Open
M9 Recipe No dry-run mode Open
M11 Bundle No component dependency visualization Open
M18 Collector OS collector assumes Linux paths ⏸️ WONTFIX (Linux-only is intentional - tool is for Linux GPU nodes)
M21 Agent Job name collisions possible ⏸️ WONTFIX
M22 Agent No resource limit customization flags Open
M26 Docs No changelog Open
M27 Build deployments/cns-agent/2-job.yaml uses fork image registry ✅ FIXED (PR #35)
M28 K8s Collector Only collects first node info ⏸️ WONTFIX (by design - collects current node via NODE_NAME env var)

11.4 Low Priority Issues (18 open, 0 fixed, 0 wontfix)

ID Category Issue Status
L1 CLI Version output format not customizable Open
L2 CLI No shell completion for flag values Open
L3 CLI Debug output very verbose Open
L4 Recipe Component versions hardcoded in overlays Open
L5 Bundle README templates not customizable Open
L6 Bundle Script templates assume bash Open
L7 Validate No constraint grouping in output Open
L8 Collector Metrics exposed but not documented Open
L9 Serializer No compression option Open
L10 Agent Labels not customizable Open
L11 Agent No annotations support Open
L12 Docs No quick start guide Open
L13 Docs No comparison with alternatives Open
L14 Docs No video tutorials Open
L15 CLI No quiet mode Open
L16 Bundle Silently overwrites existing output directory Open (E2E)
L17 CLI Local snapshot on macOS doesn't suggest --deploy-agent Open (E2E)
L18 CLI Mixed stdout/stderr output ordering Open (E2E)

12. UX Improvement Roadmap

Phase 1: Quick Wins (Low Effort, High Impact)

  1. Fix H23: Add missing --kubeconfig flag to bundle command ✅ MERGED (PR #29)
  2. Fix M27: Update deployments/2-job.yaml to use correct image registry ✅ MERGED (PR #35)
  3. Fix H16: Return error instead of silent fallback for ConfigMap writes ✅ MERGED (PR #24)
  4. Fix C1: Add --privileged flag for PSS compliance ✅ MERGED (PR #27)
  5. Fix H13: Default --fail-on-error to true ✅ MERGED (PR #30)
  6. Fix H8: Warn when using base-only config ✅ MERGED (PR #31)
  7. Fix H25: Use SSA for atomic ConfigMap updates ✅ MERGED (PR #32)
  8. Fix M3: Add command examples to help text ✅ MERGED (PR #34)

Phase 2: CLI Consistency

  1. Add short alias for --deploy-agent (H2) ⏸️ WONTFIX
  2. Move format validation to flag parsing (H20) ⏸️ WONTFIX
  3. Move criteria validation to flag parsing (H7) ⏸️ WONTFIX
  4. Add warning when using base-only config (H8) ✅ PR #31
  5. Add command aliases (M1)

Phase 3: Recipe Coverage (C2)

  1. Add A100 overlays - Common existing deployments
  2. Add L40 overlays - Common inference workloads
  3. Add GKE/AKS overlays - Major cloud providers
  4. Add RHEL overlays - Enterprise Linux
  5. Add inference overlays for all GPUs - Complete workload coverage

Phase 4: Agent Improvements

  1. Fix H24: Make ClusterRole names configurable ⏸️ WONTFIX
  2. Fix H25: Use atomic ConfigMap updates ✅ MERGED (PR #32)
  3. Add resource limit flags (M22)
  4. Add labels/annotations flags (L10, L11)

Phase 5: Documentation

  1. Add changelog (M26)
  2. Add quick start guide (L12)
  3. Add troubleshooting guide
  4. Add architecture diagrams to README

Phase 6: Observability

  1. Document exposed metrics (L8)
  2. Add structured telemetry
  3. Add timing information to outputs

13. Appendices

Appendix A: File Reference

Component Key Files
CLI pkg/cli/*.go
Recipe pkg/recipe/*.go, pkg/recipe/data/*.yaml
Bundler pkg/bundler/*.go, pkg/component/*/
Deployer pkg/deployer/provider/*/
Collector pkg/collector/*/
Snapshotter pkg/snapshotter/*.go
Agent pkg/k8s/agent/*.go
Serializer pkg/serializer/*.go
Validator pkg/validator/*.go
K8s Client pkg/k8s/client/*.go

Appendix B: Criteria Values

Service Types:

  • eks - Amazon EKS
  • gke - Google GKE
  • aks - Azure AKS
  • oke - Oracle OKE
  • self-managed - Self-managed Kubernetes

Accelerator Types:

  • h100 - NVIDIA H100
  • gb200 - NVIDIA GB200
  • a100 - NVIDIA A100
  • l40 - NVIDIA L40

Intent Types:

  • training - ML training workloads
  • inference - ML inference workloads

OS Types:

  • ubuntu - Ubuntu Linux
  • rhel - Red Hat Enterprise Linux
  • cos - Container-Optimized OS (GKE)
  • amazonlinux - Amazon Linux

Appendix C: Constraint Path Format

{Type}.{Subtype}.{Key}

Supported Types:
- K8s
- GPU
- OS
- SystemD

Examples:
- K8s.server.version
- GPU.smi.driver-version
- GPU.smi.cuda-version
- GPU.smi.gpu.count
- OS.release.ID
- OS.release.VERSION_ID
- OS.sysctl./proc/sys/kernel/osrelease
- OS.kmod.nvidia
- SystemD.containerd.service.ActiveState

Appendix D: Exit Codes

Code Current Meaning
0 Success (or validation passed, even with failures unless --fail-on-error)
1 Any error

Recommended Enhancement:

Code Proposed Meaning
0 Success
1 User error (invalid flags)
2 Execution error (API failures)
3 Validation failure (with --fail-on-error)

Appendix E: Environment Variables

Variable Used By Default Description
CNS_NAMESPACE snapshot gpu-operator Agent deployment namespace
CNS_IMAGE snapshot ghcr.io/nvidia/cns:latest Agent container image
KUBECONFIG snapshot, recipe, validate ~/.kube/config Kubernetes config path
LOG_LEVEL all info Logging level
NO_COLOR all false Disable colored output

Revision History

Version Date Changes
4.3 2026-01-15 Added PR #34 (M3) and #35 (M27). Total: 58 issues (27 open, 19 fixed, 12 wontfix). Phase 1 complete!
4.2 2026-01-15 Major refresh: All 7 PRs now MERGED (#24, #27, #29, #30, #31, #32, #33)
4.1 2026-01-15 Added L16-L18 from E2E testing
4.0 2026-01-14 Complete fresh analysis with deep context. Added H22-H25, M27-M28

Quick Reference: Issue Status Legend

Symbol Meaning
FIXED Issue resolved and merged to upstream
PARTIALLY FIXED Issue improved but not fully resolved
⏸️ WONTFIX Issue acknowledged but intentionally not fixing
Open Issue confirmed, no fix submitted yet
(E2E) Issue identified during E2E testing

Our Merged PRs (9 total)

PR Issue Description Status
#35 M27 Fix image registry in example Job manifest ✅ MERGED
#34 M3 Add examples to recipe and bundle command help ✅ MERGED
#33 (E2E) Log when CLI flags override snapshot-detected criteria ✅ MERGED
#32 H25 Use SSA for atomic ConfigMap updates ✅ MERGED
#31 H8 Warn when using base-only config ✅ MERGED
#30 H13 Default --fail-on-error to true ✅ MERGED
#29 H23 Enable kubeconfig support for bundle command ✅ MERGED
#27 C1 Add --privileged flag for PSS compliance ✅ MERGED
#24 H16 Return error instead of silent fallback ✅ MERGED

Document generated by Claude Opus 4.5 based on comprehensive codebase analysis. Last synced with upstream: 2026-01-15 (commit a68ee61)

End of Document

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment