Document Version: 4.3 Generated: 2026-01-14 Last Updated: 2026-01-15 Codebase Branch: main Upstream Commit: a68ee61 Analyzer: Claude Opus 4.5
This document provides a comprehensive UX analysis of the CNS CLI tool (cnsctl), covering CLI design patterns, agent deployment security, recipe system coverage, bundler functionality, collector subsystems, and developer experience. This v4.0 is a complete fresh analysis with deep context for each issue.
| Priority | Open | Fixed | Wontfix | Total |
|---|---|---|---|---|
| Critical | 1 | 3 | 1 | 5 |
| High | 0 | 12 | 7 | 19 |
| Medium | 8 | 4 | 4 | 16 |
| Low | 18 | 0 | 0 | 18 |
| Total | 27 | 19 | 12 | 58 |
Legend: Open = no action taken, Fixed = merged PR, Wontfix = deliberately not fixing
| PR | Description | Status | Our Work? |
|---|---|---|---|
| #5 | Add OCI Build and Push functionality | OPEN | No |
| PR | Issue | Description | Status |
|---|---|---|---|
| #35 | M27 | Fix image registry in example Job manifest | ✅ MERGED |
| #34 | M3 | Add examples to recipe and bundle command help | ✅ MERGED |
| #33 | (E2E) | Log when CLI flags override snapshot-detected criteria | ✅ MERGED |
| #32 | H25 | Use SSA for atomic ConfigMap updates | ✅ MERGED |
| #31 | H8 | Warn when using base-only config | ✅ MERGED |
| #30 | H13 | Default --fail-on-error to true | ✅ MERGED |
| #29 | H23 | Enable kubeconfig support for bundle command | ✅ MERGED |
| #27 | C1 | Add --privileged flag for PSS compliance | ✅ MERGED |
| #24 | H16 | Return error instead of silent fallback | ✅ MERGED |
| PR | Description | Status |
|---|---|---|
| #28 | Dependency upgrades | ✅ MERGED |
| #26 | Add Flox Env for Dev Tooling | ✅ MERGED |
| #23 | Add --image-pull-secret flag | ✅ MERGED |
| #22 | Add info logging to collectors | ✅ MERGED |
| #21 | Add make image target | ✅ MERGED |
| #20 | Stream agent Job logs during wait | ✅ MERGED |
| #19 | Graceful degradation when D-Bus unavailable | ✅ MERGED |
| #18 | Graceful degradation when nvidia-smi missing | ✅ MERGED |
| #17 | Case-insensitive bundle type with typo suggestions | ✅ MERGED |
| #16 | Improve resource cleanup error handling | ✅ MERGED |
| #15 | Fix agent-deployment.md documentation | ✅ MERGED |
| #14 | Add validation for recipe criteria | ✅ MERGED |
| #12 | Standardize CLI flag aliases | ✅ MERGED |
- Complete fresh analysis with updated codebase (commit 5620b0d)
- Deep context added for every issue explaining why it matters
- Identified 4 new issues (H22-H25) from deep analysis
- Added documentation/Makefile/YAML analysis findings
- Created 6 PRs: #24, #27, #29, #30, #31, #32
- Corrected issue counts: 55 total issues (was incorrectly stated as 54)
- More comprehensive call graphs and architecture diagrams
- Command Architecture
- CLI Flag Analysis
- Call Graphs
- Agent Deployment System
- Recipe System
- Bundler System
- Collector System
- Serializer System
- Documentation Analysis
- Build System Analysis
- Issue Catalog
- UX Improvement Roadmap
- Appendices
cnsctl (root)
├── snapshot - Capture system configuration snapshot
├── recipe - Generate configuration recipe from criteria
├── bundle - Generate artifact bundle from recipe
├── validate - Validate cluster against recipe constraints
├── completion - Shell completion scripts (visible since PR #8)
└── version - Display version information
| Flag | Type | Default | Env Var | Description |
|---|---|---|---|---|
--debug |
bool | false | CNS_DEBUG |
Enable debug logging |
--log-json |
bool | false | CNS_LOG_JSON |
Enable structured JSON logging |
| Flag | Alias | Type | Default | Used By |
|---|---|---|---|---|
--output |
-o |
string | stdout | snapshot, recipe, validate, bundle |
--format |
-f |
string | yaml | snapshot, recipe, validate |
--kubeconfig |
-k |
string | (auto) | snapshot, recipe, validate |
User Request
│
├─► snapshot ─► Collectors (GPU/K8s/OS/SystemD) ─► Serializer ─► Output
│ │
│ └─► [--deploy-agent] ─► K8s Job ─► ConfigMap
│
├─► recipe ─► Criteria ─► Overlay Matcher ─► Merger ─► RecipeResult
│ │
│ └─► [--snapshot] ─► Extract criteria from snapshot
│
├─► validate ─► Load Recipe + Snapshot ─► Constraint Evaluator ─► Result
│
└─► bundle ─► Registry ─► Parallel Bundlers ─► Deployer ─► Files
File: pkg/cli/snapshot.go:19-174
| Flag | Alias | Type | Default | Required | Description |
|---|---|---|---|---|---|
--deploy-agent |
- | bool | false | No | Deploy K8s Job for snapshot |
--namespace |
- | string | gpu-operator | No | Agent namespace (env: CNS_NAMESPACE) |
--image |
- | string | ghcr.io/nvidia/cns:latest | No | Agent image (env: CNS_IMAGE) |
--image-pull-secret |
- | []string | [] | No | Image pull secrets for private registries |
--job-name |
- | string | cns | No | K8s Job name |
--service-account-name |
- | string | cns | No | ServiceAccount name |
--node-selector |
- | []string | [] | No | Node selectors (key=value) |
--toleration |
- | []string | [] | No | Tolerations (key=value:effect). Default: all taints tolerated |
--timeout |
- | duration | 5m | No | Job completion timeout |
--cleanup |
- | bool | true | No | Remove resources after completion |
--output |
-o |
string | stdout | No | Output destination |
--format |
-f |
string | yaml | No | Output format (yaml, json, table) |
--kubeconfig |
-k |
string | (auto) | No | Path to kubeconfig file |
Key Observations:
- The
--cleanupflag defaults totruesince PR/commit fixing it --tolerationwhen empty uses universal toleration (operator: Exists)--kubeconfigflag is present but not used in local snapshot mode (only agent mode)
File: pkg/cli/recipe.go:21-145
| Flag | Alias | Type | Default | Required | Description |
|---|---|---|---|---|---|
--service |
- | string | - | No | K8s service type (eks, gke, aks, oke) |
--accelerator |
--gpu |
string | - | No | GPU type (h100, gb200, a100, l40) |
--intent |
- | string | - | No | Workload intent (training, inference) |
--os |
- | string | - | No | OS type (ubuntu, rhel, cos, amazonlinux) |
--nodes |
- | int | 0 | No | Number of GPU nodes |
--snapshot |
-s |
string | - | No | Path/URI to snapshot |
--output |
-o |
string | stdout | No | Output destination |
--format |
-f |
string | yaml | No | Output format |
--kubeconfig |
-k |
string | (auto) | No | Kubeconfig for ConfigMap access |
Key Observations:
- Either criteria flags OR
--snapshotshould be provided - If
--snapshotprovided, criteria are extracted from it - CLI criteria flags override snapshot-extracted values
- Validation added in PR #14: at least one criteria required
File: pkg/cli/bundle.go:25-202
| Flag | Alias | Type | Default | Required | Description |
|---|---|---|---|---|---|
--recipe |
-r |
string | - | Yes | Path/URI to recipe |
--bundlers |
-b |
[]string | [] | No | Bundler types to execute |
--output |
-o |
string | . |
No | Output directory |
--set |
- | []string | [] | No | Value overrides (bundler:path=value) |
--system-node-selector |
- | []string | [] | No | System component node selectors |
--system-node-toleration |
- | []string | [] | No | System component tolerations |
--accelerated-node-selector |
- | []string | [] | No | GPU node selectors |
--accelerated-node-toleration |
- | []string | [] | No | GPU node tolerations |
--deployer |
- | string | script | No | Deployment method (script, argocd, flux) |
Key Observations:
--outputis a directory here vs file for other commands (H19)- Has
--kubeconfigflag defined but never used in code (see H23) --bundlersis case-insensitive since PR #17- When
--bundlersempty, all registered bundlers execute
File: pkg/cli/validate.go:20-167
| Flag | Alias | Type | Default | Required | Description |
|---|---|---|---|---|---|
--recipe |
-r |
string | - | Yes | Path/URI to recipe |
--snapshot |
-s |
string | - | Yes | Path/URI to snapshot |
--fail-on-error |
- | bool | false | No | Exit non-zero on validation failure |
--output |
-o |
string | stdout | No | Output destination |
--format |
-f |
string | yaml | No | Output format |
--kubeconfig |
-k |
string | (auto) | No | Kubeconfig for ConfigMap access |
Key Observations:
- Both
--recipeand--snapshotare required - Without
--fail-on-error, validation failures return exit code 0 (H13) - Supports ConfigMap URIs for both inputs
| Aspect | Commands | Status | Issue |
|---|---|---|---|
--format alias -f |
All with format | ✅ Consistent | Fixed in PR #12 |
--output alias -o |
All | ✅ Consistent | - |
--output meaning |
bundle=dir, others=file | ❌ Inconsistent | H19 (WONTFIX) |
--kubeconfig alias -k |
snapshot, recipe, validate | ✅ Consistent | - |
--kubeconfig on bundle |
Defined but unused | ❌ Dead code | H23 (NEW) |
Short alias for --deploy-agent |
snapshot | ❌ Missing | H2 |
Short alias for --fail-on-error |
validate | ❌ Missing | - |
snapshotCmd() [pkg/cli/snapshot.go:19]
│
├─► Parse CLI flags [snapshot.go:119-168]
│ ├─ serializer.Format(cmd.String("format"))
│ │ └─ Returns JSON, YAML, or Table format
│ ├─ collector.NewDefaultFactory(collector.WithVersion(version))
│ │ └─ Creates factory for GPU, K8s, OS, SystemD collectors
│ └─ snapshotter.NodeSnapshotter{} initialization
│ ├─ Version, Factory, Serializer configured
│ └─ AgentConfig set if --deploy-agent
│
└─► ns.Measure(ctx) [pkg/snapshotter/snapshot.go:42]
│
├─ IF AgentConfig.Enabled:
│ └─► n.measureWithAgent(ctx) [pkg/snapshotter/agent.go:126-223]
│ ├─► k8sclient.GetKubeClient(kubeconfig)
│ │ └─ Returns cached clientset, restconfig
│ ├─► agent.NewDeployer(clientset, config, opts...)
│ │ └─ Configures namespace, image, nodeSelector, etc.
│ │
│ ├─► deployer.Deploy(ctx) [pkg/k8s/agent/deployer.go:13-47]
│ │ ├─► d.CheckPermissions(ctx) [permissions.go:11-76]
│ │ │ └─ SelfSubjectAccessReview for each required permission
│ │ ├─► d.ensureServiceAccount(ctx) [rbac.go:16-25]
│ │ ├─► d.ensureRole(ctx) [rbac.go:27-49]
│ │ ├─► d.ensureRoleBinding(ctx) [rbac.go:51-77]
│ │ ├─► d.ensureClusterRole(ctx) [rbac.go:79-110]
│ │ │ └─ **HARDCODED name: "cns-node-reader"** (H24)
│ │ ├─► d.ensureClusterRoleBinding(ctx) [rbac.go:112-143]
│ │ │ └─ **HARDCODED name: "cns-node-reader"** (H24)
│ │ └─► d.ensureJob(ctx) [job.go:12-31]
│ │ └─► d.buildJob(ctx) [job.go:33-138]
│ │ ├─ Builds privileged pod spec
│ │ ├─ Sets nodeSelector, tolerations
│ │ └─ Adds volume mounts for /run/systemd
│ │
│ ├─► deployer.WaitForJobCompletion(ctx, timeout) [wait.go:13-93]
│ │ ├─ Watch Job status
│ │ ├─► WaitForPodReady(ctx) [wait.go:96-147]
│ │ │ └─ Detect pod errors: CrashLoopBackOff, ImagePullBackOff, etc.
│ │ └─► StreamLogs(ctx) [wait.go:150-195]
│ │ └─ Stream pod logs with [agent] prefix
│ │
│ ├─► deployer.GetSnapshot(ctx) [deployer.go:119-166]
│ │ └─ Read from ConfigMap, parse YAML
│ │
│ └─ defer: deployer.Cleanup(ctx, opts) [deployer.go:72-117]
│ └─ Delete Job, SA, Role, RoleBinding, ClusterRole, ClusterRoleBinding
│
└─ ELSE (local mode):
└─► n.measure(ctx) [pkg/snapshotter/snapshot.go:53-193]
├─ errgroup.WithContext(ctx)
│
├─ g.Go: metadata collection
│ └─ Hostname, timestamp, version
│
├─ g.Go: k8sCollector.Collect(gctx)
│ └─ [pkg/collector/k8s/k8s.go]
│ ├─ Server version from /version
│ ├─ Pod images from all namespaces
│ ├─ ClusterPolicy from nvidia.com
│ └─ Node info (first node)
│
├─ g.Go: systemdCollector.Collect(gctx)
│ └─ [pkg/collector/systemd/systemd.go]
│ └─ D-Bus queries for containerd, docker, kubelet
│ └─ **Graceful degradation** if D-Bus unavailable (PR #19)
│
├─ g.Go: osCollector.Collect(gctx)
│ └─ [pkg/collector/os/os.go]
│ ├─ /proc/cmdline (grub params)
│ ├─ /proc/modules (kmod)
│ ├─ /proc/sys/* (sysctl)
│ └─ /etc/os-release
│
├─ g.Go: gpuCollector.Collect(gctx)
│ └─ [pkg/collector/gpu/gpu.go]
│ └─ nvidia-smi -q -x
│ └─ **Graceful degradation** if nvidia-smi missing (PR #18)
│
├─ g.Wait()
│ └─ Fail-fast on first error
│
└─► n.Serializer.Serialize(ctx, snap)
└─ Output to file, ConfigMap, or stdout
recipeCmd() [pkg/cli/recipe.go:21]
│
├─► Parse CLI flags
│ └─ serializer.Format(cmd.String("format"))
│
├─► recipe.NewBuilder(recipe.WithVersion(version))
│
├─ IF --snapshot provided:
│ │
│ ├─► serializer.FromFileWithKubeconfig[Snapshot](path, kubeconfig)
│ │ └─ Supports: file path, HTTP/HTTPS URL, cm://namespace/name
│ │
│ ├─► extractCriteriaFromSnapshot(snap) [recipe.go:170-268]
│ │ │
│ │ ├─ TypeK8s → Service detection
│ │ │ ├─ Check K8s.server.version for "-eks-", "-gke", "-aks"
│ │ │ └─ Map to CriteriaServiceEKS, etc.
│ │ │
│ │ ├─ TypeGPU → Accelerator detection
│ │ │ └─ Check gpu.model for "h100", "gb200", "a100", "l40"
│ │ │
│ │ └─ TypeOS → OS detection
│ │ └─ Check OS.release.ID
│ │
│ └─► applyCriteriaOverrides(cmd, criteria) [recipe.go:270-304]
│ └─ CLI flags override snapshot-extracted values
│
├─ ELSE:
│ └─► buildCriteriaFromCmd(cmd) [recipe.go:148-168]
│ └─► recipe.BuildCriteria(opts...)
│ └─ Validation: at least one criteria required (PR #14)
│
└─► builder.BuildFromCriteria(ctx, criteria) [pkg/recipe/builder.go:42-95]
│
├─► loadMetadataStore(ctx) [pkg/recipe/metadata_store.go:39-135]
│ ├─ fs.WalkDir(metadataFS, "data")
│ │ └─ Embedded files: base.yaml + overlay/*.yaml
│ ├─ Parse base.yaml
│ │ └─ Default components, constraints
│ └─ Parse overlay/*.yaml files
│ └─ Environment-specific configurations
│
└─► store.BuildRecipeResult(ctx, criteria) [metadata_store.go:169-225]
│
├─► store.FindMatchingOverlays(criteria)
│ └─ For each overlay:
│ └─ overlay.Spec.Criteria.Matches(criteria)
│ └─ Specificity scoring (0-5 points)
│
├─ Merge base with overlays (specificity order)
│ └─ Lower specificity first, then higher
│
├─ mergedSpec.ValidateDependencies()
│ └─ Check all dependencyRefs resolve
│
├─ mergedSpec.TopologicalSort()
│ └─ Order by deploymentOrder
│
└─ Return RecipeResult with:
├─ Criteria (input + detected)
├─ ComponentRefs (with values, overrides)
├─ Constraints (validation rules)
└─ Metadata (appliedOverlays, version)
bundleCmd() [pkg/cli/bundle.go:25]
│
├─► Parse CLI flags
│ ├─► config.ParseValueOverrides(--set flags)
│ │ └─ Parse "bundler:path.to.field=value" format
│ ├─► snapshotter.ParseNodeSelectors()
│ ├─► snapshotter.ParseTolerations()
│ └─ Validate deployer type: script, argocd, flux
│
├─► serializer.FromFile[RecipeResult](recipePath)
│ └─ **NOTE: Does NOT use kubeconfig flag** (H23)
│
├─► registry.NewFromGlobal(config) [pkg/bundler/registry/registry.go]
│ └─ Auto-registered bundlers via init():
│ ├─ certmanager [pkg/component/certmanager/]
│ ├─ gpuoperator [pkg/component/gpuoperator/]
│ ├─ networkoperator [pkg/component/networkoperator/]
│ ├─ nvsentinel [pkg/component/nvsentinel/]
│ └─ skyhook [pkg/component/skyhook/]
│
├─► bundler.New(opts...) [pkg/bundler/bundler.go:136-173]
│ └─ Apply overrides, node selectors, tolerations
│
└─► b.Make(ctx, recipe, outputDir) [bundler.go:180-244]
│
├─ Validate input (non-nil recipe)
│
├─ Create output directory
│
├─► b.selectBundlers(input, types) [bundler.go:389-425]
│ └─ If types empty, select all registered bundlers
│
├─► b.makeParallel(ctx, input, dir, bundlers) [bundler.go:248-334]
│ └─ errgroup.WithContext(ctx)
│ └─ For each bundler (concurrent):
│ └─► b.executeBundler(ctx, type, bundler, input, dir)
│ ├─► bundler.Validate(ctx, input)
│ │ └─ Check component exists in recipe
│ └─► bundler.Make(ctx, input, dir)
│ ├─ GetComponentRef(name)
│ ├─ GetValuesForComponent(name)
│ │ └─ Merge: base → valuesFile → overrides → CLI --set
│ ├─ CreateBundleDir(subdir)
│ ├─ GenerateFileFromTemplate(values.yaml)
│ ├─ GenerateFileFromTemplate(install.sh)
│ ├─ GenerateFileFromTemplate(uninstall.sh)
│ ├─ GenerateFileFromTemplate(README.md)
│ └─ GenerateResult() with checksums
│
└─► b.createRootArtifacts(ctx, input, dir) [bundler.go:430-461]
├─► b.writeRecipeFile(recipe, dir)
│ └─ Copy recipe.yaml to output
└─► deployer.Generate(ctx, recipe, dir)
├─ ArgoCD: app-of-apps.yaml + Application CRs per component
├─ Flux: kustomization.yaml + HelmRelease CRs with dependsOn
└─ Script: README.md with helm install commands
validateCmd() [pkg/cli/validate.go:20]
│
├─► Parse CLI flags
│ └─ serializer.Format(cmd.String("format"))
│
├─► serializer.FromFileWithKubeconfig[RecipeResult](recipePath, kubeconfig)
│ └─ Supports: file, HTTP/HTTPS URL, cm://namespace/name
│
├─► serializer.FromFileWithKubeconfig[Snapshot](snapshotPath, kubeconfig)
│ └─ Supports: file, HTTP/HTTPS URL, cm://namespace/name
│
├─► validator.New(validator.WithVersion(version))
│
└─► v.Validate(ctx, recipe, snapshot) [pkg/validator/validator.go:49-108]
│
├─► NewValidationResult()
│
├─ For each recipe.Constraints:
│ └─► v.evaluateConstraint(constraint, snap) [validator.go:111-185]
│ │
│ ├─► ParseConstraintPath(constraint.Name)
│ │ └─ Split "{Type}.{Subtype}.{Key}"
│ │
│ ├─► path.ExtractValue(snap)
│ │ └─ Find matching measurement.subtype.data[key]
│ │
│ ├─► ParseConstraintExpression(constraint.Value)
│ │ └─ Parse operators: >=, <=, ==, !=, >, <
│ │
│ └─► parsed.Evaluate(actual)
│ └─ Version comparison or string match
│
├─ Calculate summary:
│ ├─ Passed count
│ ├─ Failed count
│ ├─ Skipped count (missing data)
│ └─ Overall status: pass/fail/partial
│
└─ Return ValidationResult
┌─────────────────────────────────────────────────────────────────┐
│ User Workstation │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ cnsctl snapshot --deploy-agent ││
│ │ │ ││
│ │ ├─► CheckPermissions() ─ SelfSubjectAccessReview ││
│ │ │ └─ Verifies: create configmaps, get nodes, etc. ││
│ │ │ ││
│ │ ├─► Deploy() ─ Create RBAC + Job ││
│ │ │ ├─ ServiceAccount, Role, RoleBinding (namespaced) ││
│ │ │ ├─ ClusterRole, ClusterRoleBinding (cluster-scoped) ││
│ │ │ │ └─ **HARDCODED names** (H24) ││
│ │ │ └─ Job with privileged pod ││
│ │ │ ││
│ │ ├─► WaitForJobCompletion() ─ Watch Job status ││
│ │ │ ├─ WaitForPodReady() with error detection ││
│ │ │ │ └─ CrashLoopBackOff, ImagePullBackOff, etc. ││
│ │ │ └─ StreamLogs() with [agent] prefix ││
│ │ │ ││
│ │ ├─► GetSnapshot() ─ Read from ConfigMap ││
│ │ │ └─ Parse YAML from data.snapshot.yaml ││
│ │ │ ││
│ │ └─► Cleanup() ─ Delete resources (if --cleanup=true) ││
│ │ └─ Attempts all deletions, reports errors ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Namespace: gpu-operator (default) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │ │
│ │ │ServiceAcct │ │ Role │ │ RoleBinding │ │ │
│ │ │ "cns" │ │ "cns" │ │ "cns" │ │ │
│ │ └─────────────┘ └─────────────┘ └──────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ Job "cns" │ │ │
│ │ │ ┌───────────────────────────────────────────────┐ │ │ │
│ │ │ │ Pod (privileged, hostPID/Net/IPC, root) │ │ │ │
│ │ │ │ ├─ GPU Collector (nvidia-smi) │ │ │ │
│ │ │ │ ├─ K8s Collector (API client) │ │ │ │
│ │ │ │ ├─ OS Collector (/proc, /etc) │ │ │ │
│ │ │ │ └─ SystemD Collector (D-Bus) │ │ │ │
│ │ │ └───────────────────────────────────────────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ ConfigMap "cns-snapshot" │ │ │
│ │ │ labels: │ │ │
│ │ │ app.kubernetes.io/name: cns │ │ │
│ │ │ app.kubernetes.io/component: snapshot │ │ │
│ │ │ app.kubernetes.io/version: <version> │ │ │
│ │ │ data: │ │ │
│ │ │ snapshot.yaml: "<YAML content>" │ │ │
│ │ │ format: yaml │ │ │
│ │ │ timestamp: "2026-01-14T10:30:00Z" │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Cluster-Scoped Resources │ │
│ │ ┌───────────────────┐ ┌─────────────────────────────┐ │ │
│ │ │ ClusterRole │ │ ClusterRoleBinding │ │ │
│ │ │ "cns-node-reader" │ │ "cns-node-reader" │ │ │
│ │ │ (HARDCODED!) │ │ (HARDCODED!) │ │ │
│ │ └───────────────────┘ └─────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Resource | Verbs | Purpose |
|---|---|---|
| configmaps | create, get, update, patch | Store snapshot data |
| pods | get, list | Monitor Job pod status |
| pods/log | get | Stream pod logs |
| Resource | API Group | Verbs | Purpose |
|---|---|---|---|
| nodes | "" | get, list | Query node info |
| pods | "" | get, list | List all pods |
| services | "" | get, list | List all services |
| clusterpolicies | nvidia.com | get, list | NVIDIA GPU policies |
Issue H24: ClusterRole/ClusterRoleBinding names are hardcoded to "cns-node-reader"
- Cannot customize via
--job-nameor--service-account-name - Multiple concurrent deployments in different namespaces share same cluster resources
- Cleanup in one namespace may affect another
| Setting | Value | Security Implication |
|---|---|---|
runAsUser |
0 (root) | Full system access |
privileged |
true | Bypass container isolation |
hostPID |
true | See all host processes |
hostNetwork |
true | Access host network |
hostIPC |
true | Access host IPC |
capabilities |
SYS_ADMIN, SYS_CHROOT | System-level operations |
Why Privileged is Required:
- nvidia-smi: Requires access to GPU devices
- D-Bus: Requires access to system D-Bus socket
- /proc files: Requires host PID namespace
- SystemD properties: Requires host IPC namespace
Our PR #27 adds --privileged flag to allow unprivileged mode for PSS-restricted clusters.
| Resource | Request | Limit |
|---|---|---|
| CPU | 1 | 2 |
| Memory | 4Gi | 8Gi |
| Ephemeral Storage | 2Gi | 4Gi |
Issue M22: These values are hardcoded, no flags to customize.
The wait logic now detects these pod failure conditions:
ImagePullBackOffErrImagePullInvalidImageNameCrashLoopBackOffCreateContainerErrorCreateContainerConfigErrorRunContainerError
Each returns a clear error message with the reason.
Attack Surface:
- Privileged container - Can escape container to host
- Host namespace access - Can observe all system activity
- Root execution - Full node access
- RBAC persistence - Cluster-scoped resources persist if cleanup fails
Mitigations:
- Permission check before deployment (
CheckPermissions()) - Automatic cleanup on completion (default: enabled)
- Resource limits prevent DoS
- Single execution (BackoffLimit: 0)
- Hard timeout (ActiveDeadlineSeconds: 18000 = 5 hours)
- Pod error detection with clear messages
kind: RecipeResult
apiVersion: cns.nvidia.com/v1alpha1
metadata:
generatedAt: "2026-01-14T10:00:00Z"
version: "v0.19.0"
appliedOverlays:
- gb200-eks-ubuntu-training
criteria:
service: eks
accelerator: gb200
intent: training
os: ubuntu
nodes: 8
componentRefs:
- name: cert-manager
type: Helm
chart: cert-manager
version: v1.16.2
repository: https://charts.jetstack.io
namespace: cert-manager
deploymentOrder: 1
valuesFile: components/cert-manager/values.yaml
overrides:
installCRDs: true
- name: gpu-operator
type: Helm
chart: gpu-operator
version: v25.3.4
repository: https://helm.ngc.nvidia.com/nvidia
namespace: gpu-operator
deploymentOrder: 2
valuesFile: components/gpu-operator/eks-gb200-training.yaml
dependencyRefs:
- cert-manager
constraints:
- name: K8s.server.version
value: ">= 1.32"
- name: OS.release.ID
value: ubuntu
deploymentOrder:
- cert-manager
- gpu-operator
- network-operator
- nvsentinel
- skyhookFile Structure:
pkg/recipe/data/
├── base.yaml # Default components and settings
└── *.yaml # Overlay files (3 currently)
├── gb200-eks-ubuntu-training.yaml
├── h100-eks-ubuntu-training.yaml
└── h100-ubuntu-inference.yaml
Matching Algorithm:
- Load all overlays from
pkg/recipe/data/*.yaml - For each overlay, check if criteria matches request
- Collect all matching overlays
- Sort by specificity score (ascending)
- Merge: base → less specific → more specific
Specificity Scoring:
- Each non-"any" field adds 1 point
- Fields: service, accelerator, intent, os (nodes is optional)
- Score range: 0-4 (or 0-5 with nodes)
Example:
Query: { service: eks, accelerator: gb200, os: ubuntu, intent: training }
Overlay 1: { service: eks } → Score 1, MATCH
Overlay 2: { service: eks, accelerator: gb200 } → Score 2, MATCH
Overlay 3: { accelerator: h100 } → Score 1, NO MATCH
Merge order: base → Overlay 1 → Overlay 2
Supported Criteria Values:
| Criteria | Values | Count |
|---|---|---|
| Services | eks, gke, aks, oke | 4 |
| Accelerators | h100, gb200, a100, l40 | 4 |
| Intents | training, inference | 2 |
| OS | ubuntu, rhel, cos, amazonlinux | 4 |
Total Specific Combinations: 4 × 4 × 2 × 4 = 128
Current Overlays (3 files):
| Overlay | Service | Accelerator | OS | Intent | Specificity |
|---|---|---|---|---|---|
| gb200-eks-ubuntu-training | eks | gb200 | ubuntu | training | 4/4 |
| h100-eks-ubuntu-training | eks | h100 | ubuntu | training | 4/4 |
| h100-ubuntu-inference | any | h100 | ubuntu | inference | 3/4 |
Coverage: 3/128 = 2.34% (Issue C2)
| Gap Category | Missing Combinations | Count |
|---|---|---|
| A100 accelerator | All A100 combinations | 32 |
| L40 accelerator | All L40 combinations | 32 |
| GB200 non-EKS | gke/aks/oke + gb200 | 24 |
| GB200 inference | Any service + gb200 + inference | 16 |
| Non-Ubuntu OS | rhel/cos/amazonlinux + any | 96 |
| H100 training non-EKS | gke/aks/oke + h100 + training | 3 |
| GKE service | All GKE combinations | 32 |
| AKS service | All AKS combinations | 32 |
| OKE service | All OKE combinations | 32 |
Impact: Most user queries fall back to base configuration only, missing environment-specific optimizations.
| Bundler | Bundle Type | Key Outputs |
|---|---|---|
| cert-manager | cert-manager |
values.yaml, install.sh, README |
| gpu-operator | gpu-operator |
values.yaml, clusterpolicy.yaml, scripts |
| network-operator | network-operator |
values.yaml, scripts, README |
| skyhook | skyhook |
values.yaml, customization CRs, scripts |
| nvsentinel | nvsentinel |
values.yaml, scripts, README |
Each bundler self-registers via init():
// pkg/component/gpuoperator/bundler.go
func init() {
registry.MustRegister(Name, NewBundler())
}
const Name = types.BundleType("gpu-operator")Bundler Interface:
type Bundler interface {
Type() BundleType
Make(ctx context.Context, input *recipe.RecipeResult, outputDir string) (*Result, error)
}
type ValidatableBundler interface {
Bundler
Validate(ctx context.Context, input *recipe.RecipeResult) error
}Format: --set bundler:path.to.field=value
Merge Precedence (lowest to highest):
- Base values (from recipe data)
- valuesFile content
- Recipe overrides field
- CLI
--setflags
Example Paths by Bundler:
gpuoperator:operator.nodeSelector=key=value
gpuoperator:daemonsets.nodeSelector=key=value
gpuoperator:dcgmExporter.config.create=true
gpuoperator:gds.enabled=true
gpuoperator:driver.version=570.133.20
gpuoperator:cdi.enabled=true
gpuoperator:mig.strategy=mixed
networkoperator:operator.repository=myregistry.com
networkoperator:ofedDriver.version=23.04
networkoperator:ofedDriver.deploy=true
networkoperator:rdma.enabled=true
networkoperator:sriov.enabled=true
certmanager:installCRDs=true
certmanager:nodeSelector=key=value
certmanager:tolerations=...
certmanager:webhook.nodeSelector=key=value
skyhook:manager.resources.cpu.limit=2
skyhook:manager.resources.memory.limit=2Gi
skyhook:customization=ubuntu
skyhook:controllerManager.selectors=key=value
nvsentinel:namespace=nvsentinel
nvsentinel:sentinel.enabled=true
nvsentinel:sentinel.logLevel=info
nvsentinel:global.systemNodeSelector=key=value
Limitations:
- No array index override syntax (e.g.,
tolerations[0].key=value) - No wildcard paths
- Type conversion is automatic (strings → bool/int where appropriate)
| Type | Outputs | Use Case |
|---|---|---|
script |
README with helm commands, install.sh | Manual deployment |
argocd |
app-of-apps.yaml, Application CRs | GitOps with ArgoCD |
flux |
kustomization.yaml, HelmRelease CRs | GitOps with Flux |
Deployment Order Handling:
- Script: Documents order in README
- ArgoCD: Uses
argocd.argoproj.io/sync-waveannotations - Flux: Uses
spec.dependsOnfields
File: pkg/collector/factory.go
type Factory interface {
CreateSystemDCollector() Collector
CreateOSCollector() Collector
CreateKubernetesCollector() Collector
CreateGPUCollector() Collector
}| Collector | Data Sources | Key Outputs | Graceful Degradation |
|---|---|---|---|
| GPU | nvidia-smi -q -x |
driver version, CUDA, GPU model, memory, count | Yes (PR #18) - returns gpu.count=0 |
| K8s | Kubernetes API | server version, images, policies, node info | No - requires API access |
| OS | /proc, /etc | kernel, OS release, sysctl, modules | No - requires /proc access |
| SystemD | D-Bus | service status (containerd, docker, kubelet) | Yes (PR #19) - empty if D-Bus unavailable |
Data Collection:
- Execute
nvidia-smi -q -x - Parse XML output
- Extract:
- Driver version
- CUDA version
- GPU count
- GPU model (per-GPU)
- Memory info
- MIG configuration
Graceful Degradation (since PR #18):
if errors.Is(err, exec.ErrNotFound) || os.IsNotExist(err) {
slog.Warn("nvidia-smi not found, returning empty GPU measurements")
return &measurement.Measurement{
Type: measurement.TypeGPU,
Subtypes: []measurement.Subtype{{
Name: "smi",
Data: map[string]measurement.Reading{
"gpu.count": measurement.Int(0),
},
}},
}, nil
}Data Collection:
- Get server version from /version endpoint
- List all pods, extract unique images
- Get ClusterPolicy CRDs (nvidia.com/v1)
- Get node info (first node only)
Limitations:
- Only collects first node info (scalability issue for multi-node clusters)
- Lists ALL pods across ALL namespaces (can be slow on large clusters)
- No pagination for pod listing
Data Collection:
/proc/cmdline→ GRUB boot parameters/proc/modules→ Loaded kernel modules/proc/sys/*→ Sysctl parameters/etc/os-release→ OS identification
Platform Assumptions:
- Hardcoded Linux paths
- Won't work on Windows or macOS (intentional - GPU nodes are Linux)
Data Collection:
- Connect to system D-Bus
- Query properties for:
- containerd.service
- docker.service
- kubelet.service
Graceful Degradation (since PR #19):
- Returns empty measurements if D-Bus unavailable
- Logs warning but doesn't fail
type: GPU # or K8s, OS, SystemD
subtypes:
- name: smi
data:
driver-version: "570.133.20"
cuda-version: "12.8"
gpu.count: 8
gpu.model: "NVIDIA H100"
context:
driver-version: "NVIDIA driver version installed on the system"| Format | Extension | Description | Read Support | Write Support |
|---|---|---|---|---|
json |
.json | Pretty-printed JSON | Yes | Yes |
yaml |
.yaml | YAML with 2-space indent | Yes | Yes |
table |
- | Flattened key-value table | No | Yes only |
| Destination | Format | Example | Implementation |
|---|---|---|---|
| File | path | /tmp/snapshot.yaml |
writer.go:NewFileWriter |
| ConfigMap | cm://namespace/name |
cm://default/cns-snapshot |
configmap.go |
| HTTP URL | https://... |
https://example.com/snap.yaml |
http.go (read only) |
| Stdout | - or empty |
Default | writer.go:NewStdoutWriter |
Write Flow (after PR #32 - Server-Side Apply):
func (w *ConfigMapWriter) Serialize(ctx context.Context, data any) error {
// 1. Marshal data to YAML/JSON
content, err := serializeYAML(data)
// 2. Build ConfigMap apply configuration
configMap := accorev1.ConfigMap(w.name, w.namespace).
WithLabels(map[string]string{
"app.kubernetes.io/name": "cns",
"app.kubernetes.io/component": "snapshot",
"app.kubernetes.io/version": version,
}).
WithData(map[string]string{
"snapshot.yaml": string(content),
"format": string(w.format),
"timestamp": time.Now().UTC().Format(time.RFC3339),
})
// 3. Atomic Server-Side Apply (creates or updates)
_, err = client.CoreV1().ConfigMaps(w.namespace).Apply(
ctx,
configMap,
metav1.ApplyOptions{FieldManager: "cnsctl"},
)
return err
}Key improvement: PR #32 replaced the race-prone Get-then-Create/Update pattern with atomic Server-Side Apply (SSA), eliminating data loss in concurrent writes.
Issues (Status):
Race condition: Get-then-Create is not atomic✅ Fixed by PR #32 (SSA)- No context timeout: Long-running writes can block indefinitely (has 30s timeout)
Silent fallback: Invalid paths silently fall back to stdout🔄 PR #24 OPEN
func ParseURI(uri string) (scheme, namespace, name string, err error) {
// Supports:
// - cm://namespace/name → ConfigMap
// - https://example.com/... → HTTP
// - /path/to/file → File
// - - → Stdout
}docs/
├── OVERVIEW.md # High-level product overview
├── architecture/
│ ├── README.md # Architecture overview (1264 lines!)
│ ├── api-server.md # API server architecture
│ ├── cli.md # CLI architecture
│ ├── component.md # Bundler component guide
│ └── data.md # Recipe data architecture
├── demos/
│ ├── e2e.md # End-to-end demo
│ └── s3c.md # S3C demo
├── integration/
│ ├── api-reference.md # API reference (695 lines)
│ ├── automation.md # CI/CD integration
│ ├── data-flow.md # Data flow documentation
│ ├── kubernetes-deployment.md # K8s deployment guide
│ └── recipe-development.md # Recipe development guide
└── user-guide/
├── agent-deployment.md # Agent deployment guide (900 lines)
├── api-reference.md # User-facing API reference
├── cli-reference.md # CLI reference (900 lines)
└── installation.md # Installation guide
| Document | Lines | Quality | Issues |
|---|---|---|---|
| architecture/README.md | 1264 | Excellent | Very comprehensive, good diagrams |
| user-guide/cli-reference.md | 900 | Excellent | Complete flag documentation |
| user-guide/agent-deployment.md | 900 | Good | Fixed in PR #15 |
| integration/api-reference.md | 695 | Good | Complete API documentation |
| architecture/data.md | 865 | Excellent | Detailed overlay system explanation |
| integration/recipe-development.md | 650 | Good | Helpful for contributors |
Strengths:
- Comprehensive CLI reference with all flags documented
- Good architecture documentation with mermaid diagrams
- Clear examples in most documents
- Recipe data architecture well explained
Gaps:
- No changelog (M26)
- No quick start guide (L12)
- No troubleshooting guide beyond basic tips
- Some documents reference draft features
File: Makefile (145 lines)
| Target | Description | Dependencies |
|---|---|---|
info |
Print project info | - |
tidy |
Update Go modules | - |
upgrade |
Upgrade all dependencies | - |
lint |
Lint Go and YAML | lint-go, lint-yaml |
lint-go |
Run golangci-lint | - |
lint-yaml |
Run yamllint | - |
test |
Run unit tests with race detector | - |
e2e |
Run integration tests | tools/e2e |
scan |
Vulnerability scan (go vet + grype) | - |
qualify |
Full qualification | test, lint, e2e, scan |
server |
Start development server | - |
docs |
Serve Go documentation | - |
build |
Build release binaries | tidy |
image |
Build and push container image | - |
release |
Run goreleaser | - |
bump-major/minor/patch |
Version bumping | tools/bump |
clean |
Clean directories | - |
help |
Show available targets | - |
Go Version: Uses go env GOVERSION (documented in info)
Linting:
golangci-lintwith.golangci.yamlconfigyamllintwith.yamllint.yamlconfig
Release:
goreleaserwith.goreleaser.yamlconfig- Multi-platform binaries (darwin/linux, amd64/arm64)
Container Images:
- Built with
ko - Registry:
ghcr.io/nvidia(configurable viaIMAGE_REGISTRY) - Tag:
latest(configurable viaIMAGE_TAG)
Location: deployments/cns-agent/
| File | Purpose |
|---|---|
1-deps.yaml |
RBAC resources (SA, Role, RoleBinding, ClusterRole, ClusterRoleBinding) |
2-job.yaml |
Job manifest for agent deployment |
1-deps.yaml Analysis:
- Creates namespace-scoped RBAC (cns service account, role, rolebinding)
- Creates cluster-scoped RBAC (cns-node-reader clusterrole, clusterrolebinding)
- Includes secret list permission (potential security concern)
2-job.yaml Analysis:
- Uses hardcoded nodeSelector:
nodeGroup: customer-gpu - Uses specific tolerations for
dedicated=user-workload - Image:
ghcr.io/mchmarny/cns:latest(should beghcr.io/nvidia/cns:latest) - Privileged security context
Issues Found:
- Image points to
mchmarnyfork instead ofnvidia(likely test config) - NodeSelector is environment-specific
- Tolerations are environment-specific
Status: ✅ FIXED (PR #27)
File: pkg/k8s/agent/job.go:62-70
Impact: Cannot deploy on PSS-restricted clusters without exemption
Context: The agent Job requires privileged: true security context to:
- Access nvidia-smi for GPU metrics
- Read D-Bus socket for SystemD service status
- Access /proc files with host PID namespace
Why It Matters: Many enterprise Kubernetes clusters enforce Pod Security Standards (PSS) at "restricted" or "baseline" level, which prohibit privileged containers. This prevents CNS agent deployment without cluster policy exceptions.
Fix (PR #27): Adds --privileged flag (default: true) allowing --privileged=false for PSS-restricted environments. In unprivileged mode, GPU and SystemD collectors return empty/degraded results.
Status: Open
File: pkg/recipe/data/*.yaml
Impact: Most configurations use base-only settings
Context: With only 3 overlay files covering 3/128 possible criteria combinations, most user queries fall through to the base configuration without environment-specific optimizations.
Why It Matters: The value proposition of CNS is hardware-aware, environment-specific configuration generation. Without overlays for A100, L40, GKE, AKS, OKE, or non-Ubuntu OS, users get generic configurations that may not be optimal for their environment.
Missing Coverage:
- A100 GPUs (common in existing deployments)
- L40 GPUs (common for inference)
- GKE, AKS, OKE platforms (major cloud providers)
- RHEL, COS, Amazon Linux (common enterprise OSes)
- Inference workloads on GB200
Status: ✅ FIXED (PR #16)
File: pkg/k8s/agent/deployer.go:72-117
Fix: Now attempts all deletions and reports errors
Status: ✅ FIXED (PR #14)
File: pkg/cli/recipe.go:110-112
Fix: Validates at least one criteria is provided
Status: ⏸️ WONTFIX Rationale: Input validation exists; output validation adds complexity without clear benefit
Status: ✅ FIXED (PR #12)
Fix: Now uses -f consistently
Status: ⏸️ WONTFIX
Rationale: Verbosity is intentional for safety. --deploy-agent has significant side effects (creates K8s Job, RBAC, runs containers). Short aliases like -a or -d make accidental deployment too easy. The flag is typically used in scripts where verbosity doesn't hurt UX.
Status: ✅ PARTIALLY FIXED (PR #20)
Fix: Log streaming now provides real-time output with [agent] prefix
Status: ⏸️ WONTFIX
Rationale: Without --snapshot, validation is already immediate. With --snapshot, the snapshot must be loaded anyway to extract criteria. Moving enum validation to flag parsing requires custom flag types in urfave/cli v3, adding significant complexity for a narrow edge case.
Status: ✅ FIXED (PR #31)
File: pkg/recipe/metadata_store.go:199-205
Fix: Added slog.Warn() when no overlays match criteria. Warning includes criteria used and hint about potential optimization gap.
Example output: no environment-specific overlays matched, using base configuration only
Status: ✅ FIXED (PR #17) Fix: Now case-insensitive with typo suggestions
Status: ⏸️ WONTFIX Rationale: Constraint failures are environment-specific; generic suggestions would be misleading
Status: ✅ FIXED (PR #30)
File: pkg/cli/validate.go
Fix: Changed --fail-on-error to default to true. Users can opt-out with --fail-on-error=false for informational mode.
Status: ✅ FIXED (PR #18) Fix: Graceful degradation, returns gpu-count=0
Status: ✅ FIXED (PR #19) Fix: Graceful degradation when D-Bus unavailable
Status: ✅ FIXED (PR #24)
File: pkg/serializer/writer.go:34-67
Fix: Returns an error instead of silent fallback when ConfigMap URI is invalid or inaccessible.
Status: ✅ FIXED (PR #15)
Status: ⏸️ WONTFIX Rationale: Changing this would break existing workflows; documented behavior
Status: ⏸️ WONTFIX
File: pkg/cli/snapshot.go:121-124 (and similar)
Impact: Late error discovery
Context: Format validation (yaml, json, table) happens after the command starts executing, not during flag parsing.
Rationale: Validation happens as the first operation in Action handlers, so the practical impact is minimal. No expensive operations run before format validation.
Status: ✅ FIXED (PR #20)
Fix: Logs now streamed with [agent] prefix
Status: ⏸️ WONTFIX
File: pkg/cli/recipe.go:84-90
Impact: Inconsistent URI support
Context: The recipe command's --snapshot flag supports file paths, HTTP/HTTPS URLs, and ConfigMap URIs.
Rationale: Issue overstated - the flag documentation already clearly states "Supports: file paths, HTTP/HTTPS URLs, or ConfigMap URIs" and error messages are reasonably specific.
Status: ✅ FIXED (PR #29)
File: pkg/cli/bundle.go
Fix: Added kubeconfigFlag to the bundle command and uses FromFileWithKubeconfig to load recipes. Enables loading recipes from ConfigMap URIs.
Status: ⏸️ WONTFIX Rationale: ClusterRole/ClusterRoleBinding are cluster-scoped and intentionally shared. Having a single "cns-node-reader" role is simpler and avoids role proliferation. The permissions are read-only and safe to share across namespaces.
Status: ✅ FIXED (PR #32)
File: pkg/serializer/configmap.go:109-132
Fix: Replaced Get-then-Create/Update with Kubernetes Server-Side Apply (SSA). Single atomic operation handles both create and update. Field ownership tracked via FieldManager: "cnsctl".
| ID | Category | Issue | Status |
|---|---|---|---|
| M1 | CLI | No command aliases (e.g., snap for snapshot) |
Open |
| M2 | CLI | Help text formatting inconsistent | Open |
| M3 | CLI | ✅ FIXED (PR #34) | |
| M4 | CLI | Error messages don't suggest fixes | Open |
| M5 | CLI | No progress output for long operations | ✅ PARTIALLY FIXED (PR #22) |
| M6 | CLI | --kubeconfig shown for all commands but not always used |
⏸️ WONTFIX (inaccurate - all commands that have it use it; bundle missing it is H23) |
| M7 | CLI | Completion command hidden | ✅ FIXED (PR #8) |
| M8 | Recipe | Overlay files not validated at load time | Open |
| M9 | Recipe | No dry-run mode | Open |
| M11 | Bundle | No component dependency visualization | Open |
| M18 | Collector | OS collector assumes Linux paths | ⏸️ WONTFIX (Linux-only is intentional - tool is for Linux GPU nodes) |
| M21 | Agent | Job name collisions possible | ⏸️ WONTFIX |
| M22 | Agent | No resource limit customization flags | Open |
| M26 | Docs | No changelog | Open |
| M27 | Build | deployments/cns-agent/2-job.yaml uses fork image registry |
✅ FIXED (PR #35) |
| M28 | K8s Collector | ⏸️ WONTFIX (by design - collects current node via NODE_NAME env var) |
| ID | Category | Issue | Status |
|---|---|---|---|
| L1 | CLI | Version output format not customizable | Open |
| L2 | CLI | No shell completion for flag values | Open |
| L3 | CLI | Debug output very verbose | Open |
| L4 | Recipe | Component versions hardcoded in overlays | Open |
| L5 | Bundle | README templates not customizable | Open |
| L6 | Bundle | Script templates assume bash | Open |
| L7 | Validate | No constraint grouping in output | Open |
| L8 | Collector | Metrics exposed but not documented | Open |
| L9 | Serializer | No compression option | Open |
| L10 | Agent | Labels not customizable | Open |
| L11 | Agent | No annotations support | Open |
| L12 | Docs | No quick start guide | Open |
| L13 | Docs | No comparison with alternatives | Open |
| L14 | Docs | No video tutorials | Open |
| L15 | CLI | No quiet mode | Open |
| L16 | Bundle | Silently overwrites existing output directory | Open (E2E) |
| L17 | CLI | Local snapshot on macOS doesn't suggest --deploy-agent |
Open (E2E) |
| L18 | CLI | Mixed stdout/stderr output ordering | Open (E2E) |
Fix H23: Add missing✅ MERGED (PR #29)--kubeconfigflag to bundle commandFix M27: Update deployments/2-job.yaml to use correct image registry✅ MERGED (PR #35)Fix H16: Return error instead of silent fallback for ConfigMap writes✅ MERGED (PR #24)Fix C1: Add --privileged flag for PSS compliance✅ MERGED (PR #27)Fix H13: Default --fail-on-error to true✅ MERGED (PR #30)Fix H8: Warn when using base-only config✅ MERGED (PR #31)Fix H25: Use SSA for atomic ConfigMap updates✅ MERGED (PR #32)Fix M3: Add command examples to help text✅ MERGED (PR #34)
Add short alias for(H2) ⏸️ WONTFIX--deploy-agentMove format validation to flag parsing(H20) ⏸️ WONTFIXMove criteria validation to flag parsing(H7) ⏸️ WONTFIXAdd warning when using base-only config(H8) ✅ PR #31- Add command aliases (M1)
- Add A100 overlays - Common existing deployments
- Add L40 overlays - Common inference workloads
- Add GKE/AKS overlays - Major cloud providers
- Add RHEL overlays - Enterprise Linux
- Add inference overlays for all GPUs - Complete workload coverage
Fix H24: Make ClusterRole names configurable⏸️ WONTFIXFix H25: Use atomic ConfigMap updates✅ MERGED (PR #32)- Add resource limit flags (M22)
- Add labels/annotations flags (L10, L11)
- Add changelog (M26)
- Add quick start guide (L12)
- Add troubleshooting guide
- Add architecture diagrams to README
- Document exposed metrics (L8)
- Add structured telemetry
- Add timing information to outputs
| Component | Key Files |
|---|---|
| CLI | pkg/cli/*.go |
| Recipe | pkg/recipe/*.go, pkg/recipe/data/*.yaml |
| Bundler | pkg/bundler/*.go, pkg/component/*/ |
| Deployer | pkg/deployer/provider/*/ |
| Collector | pkg/collector/*/ |
| Snapshotter | pkg/snapshotter/*.go |
| Agent | pkg/k8s/agent/*.go |
| Serializer | pkg/serializer/*.go |
| Validator | pkg/validator/*.go |
| K8s Client | pkg/k8s/client/*.go |
Service Types:
eks- Amazon EKSgke- Google GKEaks- Azure AKSoke- Oracle OKEself-managed- Self-managed Kubernetes
Accelerator Types:
h100- NVIDIA H100gb200- NVIDIA GB200a100- NVIDIA A100l40- NVIDIA L40
Intent Types:
training- ML training workloadsinference- ML inference workloads
OS Types:
ubuntu- Ubuntu Linuxrhel- Red Hat Enterprise Linuxcos- Container-Optimized OS (GKE)amazonlinux- Amazon Linux
{Type}.{Subtype}.{Key}
Supported Types:
- K8s
- GPU
- OS
- SystemD
Examples:
- K8s.server.version
- GPU.smi.driver-version
- GPU.smi.cuda-version
- GPU.smi.gpu.count
- OS.release.ID
- OS.release.VERSION_ID
- OS.sysctl./proc/sys/kernel/osrelease
- OS.kmod.nvidia
- SystemD.containerd.service.ActiveState
| Code | Current Meaning |
|---|---|
| 0 | Success (or validation passed, even with failures unless --fail-on-error) |
| 1 | Any error |
Recommended Enhancement:
| Code | Proposed Meaning |
|---|---|
| 0 | Success |
| 1 | User error (invalid flags) |
| 2 | Execution error (API failures) |
| 3 | Validation failure (with --fail-on-error) |
| Variable | Used By | Default | Description |
|---|---|---|---|
CNS_NAMESPACE |
snapshot | gpu-operator | Agent deployment namespace |
CNS_IMAGE |
snapshot | ghcr.io/nvidia/cns:latest | Agent container image |
KUBECONFIG |
snapshot, recipe, validate | ~/.kube/config | Kubernetes config path |
LOG_LEVEL |
all | info | Logging level |
NO_COLOR |
all | false | Disable colored output |
| Version | Date | Changes |
|---|---|---|
| 4.3 | 2026-01-15 | Added PR #34 (M3) and #35 (M27). Total: 58 issues (27 open, 19 fixed, 12 wontfix). Phase 1 complete! |
| 4.2 | 2026-01-15 | Major refresh: All 7 PRs now MERGED (#24, #27, #29, #30, #31, #32, #33) |
| 4.1 | 2026-01-15 | Added L16-L18 from E2E testing |
| 4.0 | 2026-01-14 | Complete fresh analysis with deep context. Added H22-H25, M27-M28 |
| Symbol | Meaning |
|---|---|
| ✅ FIXED | Issue resolved and merged to upstream |
| ✅ PARTIALLY FIXED | Issue improved but not fully resolved |
| ⏸️ WONTFIX | Issue acknowledged but intentionally not fixing |
| Open | Issue confirmed, no fix submitted yet |
| (E2E) | Issue identified during E2E testing |
| PR | Issue | Description | Status |
|---|---|---|---|
| #35 | M27 | Fix image registry in example Job manifest | ✅ MERGED |
| #34 | M3 | Add examples to recipe and bundle command help | ✅ MERGED |
| #33 | (E2E) | Log when CLI flags override snapshot-detected criteria | ✅ MERGED |
| #32 | H25 | Use SSA for atomic ConfigMap updates | ✅ MERGED |
| #31 | H8 | Warn when using base-only config | ✅ MERGED |
| #30 | H13 | Default --fail-on-error to true | ✅ MERGED |
| #29 | H23 | Enable kubeconfig support for bundle command | ✅ MERGED |
| #27 | C1 | Add --privileged flag for PSS compliance | ✅ MERGED |
| #24 | H16 | Return error instead of silent fallback | ✅ MERGED |
Document generated by Claude Opus 4.5 based on comprehensive codebase analysis. Last synced with upstream: 2026-01-15 (commit a68ee61)
End of Document