dims/cns-ux-analysis-v4.md

## cns-ux-analysis-v4.md

      
    Raw
  

              cns-ux-analysis-v4.md
            
          
    CNS (Cloud Native Stack) CLI UX Analysis - v4.0

Document Version: 4.3
Generated: 2026-01-14
Last Updated: 2026-01-15
Codebase Branch: main
Upstream Commit: a68ee61
Analyzer: Claude Opus 4.5

Executive Summary

This document provides a comprehensive UX analysis of the CNS CLI tool (cnsctl), covering CLI design patterns, agent deployment security, recipe system coverage, bundler functionality, collector subsystems, and developer experience. This v4.0 is a complete fresh analysis with deep context for each issue.
Key Findings Summary


Priority
Open
Fixed
Wontfix
Total


Critical
1
3
1
5


High
0
12
7
19


Medium
8
4
4
16


Low
18
0
0
18


Total
27
19
12
58


Legend: Open = no action taken, Fixed = merged PR, Wontfix = deliberately not fixing
Open PRs (as of 2026-01-15)


PR
Description
Status
Our Work?


#5
Add OCI Build and Push functionality
OPEN
No


Merged PRs from This Analysis (Our Work)


PR
Issue
Description
Status


#35
M27
Fix image registry in example Job manifest
✅ MERGED


#34
M3
Add examples to recipe and bundle command help
✅ MERGED


#33
(E2E)
Log when CLI flags override snapshot-detected criteria
✅ MERGED


#32
H25
Use SSA for atomic ConfigMap updates
✅ MERGED


#31
H8
Warn when using base-only config
✅ MERGED


#30
H13
Default --fail-on-error to true
✅ MERGED


#29
H23
Enable kubeconfig support for bundle command
✅ MERGED


#27
C1
Add --privileged flag for PSS compliance
✅ MERGED


#24
H16
Return error instead of silent fallback
✅ MERGED


Other Merged PRs (Reference)


PR
Description
Status


#28
Dependency upgrades
✅ MERGED


#26
Add Flox Env for Dev Tooling
✅ MERGED


#23
Add --image-pull-secret flag
✅ MERGED


#22
Add info logging to collectors
✅ MERGED


#21
Add make image target
✅ MERGED


#20
Stream agent Job logs during wait
✅ MERGED


#19
Graceful degradation when D-Bus unavailable
✅ MERGED


#18
Graceful degradation when nvidia-smi missing
✅ MERGED


#17
Case-insensitive bundle type with typo suggestions
✅ MERGED


#16
Improve resource cleanup error handling
✅ MERGED


#15
Fix agent-deployment.md documentation
✅ MERGED


#14
Add validation for recipe criteria
✅ MERGED


#12
Standardize CLI flag aliases
✅ MERGED


What's New in v4.0


Complete fresh analysis with updated codebase (commit 5620b0d)
Deep context added for every issue explaining why it matters
Identified 4 new issues (H22-H25) from deep analysis
Added documentation/Makefile/YAML analysis findings
Created 6 PRs: #24, #27, #29, #30, #31, #32
Corrected issue counts: 55 total issues (was incorrectly stated as 54)
More comprehensive call graphs and architecture diagrams


Table of Contents


Command Architecture
CLI Flag Analysis
Call Graphs
Agent Deployment System
Recipe System
Bundler System
Collector System
Serializer System
Documentation Analysis
Build System Analysis
Issue Catalog
UX Improvement Roadmap
Appendices


1. Command Architecture

1.1 Command Hierarchy

cnsctl (root)
├── snapshot    - Capture system configuration snapshot
├── recipe      - Generate configuration recipe from criteria
├── bundle      - Generate artifact bundle from recipe
├── validate    - Validate cluster against recipe constraints
├── completion  - Shell completion scripts (visible since PR #8)
└── version     - Display version information

1.2 Global Flags


Flag
Type
Default
Env Var
Description


--debug
bool
false
CNS_DEBUG
Enable debug logging


--log-json
bool
false
CNS_LOG_JSON
Enable structured JSON logging


1.3 Shared Flags


Flag
Alias
Type
Default
Used By


--output
-o
string
stdout
snapshot, recipe, validate, bundle


--format
-f
string
yaml
snapshot, recipe, validate


--kubeconfig
-k
string
(auto)
snapshot, recipe, validate


1.4 Command Flow Overview

User Request
    │
    ├─► snapshot ─► Collectors (GPU/K8s/OS/SystemD) ─► Serializer ─► Output
    │       │
    │       └─► [--deploy-agent] ─► K8s Job ─► ConfigMap
    │
    ├─► recipe ─► Criteria ─► Overlay Matcher ─► Merger ─► RecipeResult
    │       │
    │       └─► [--snapshot] ─► Extract criteria from snapshot
    │
    ├─► validate ─► Load Recipe + Snapshot ─► Constraint Evaluator ─► Result
    │
    └─► bundle ─► Registry ─► Parallel Bundlers ─► Deployer ─► Files


2. CLI Flag Analysis

2.1 Snapshot Command Flags

File: pkg/cli/snapshot.go:19-174


Flag
Alias
Type
Default
Required
Description


--deploy-agent
-
bool
false
No
Deploy K8s Job for snapshot


--namespace
-
string
gpu-operator
No
Agent namespace (env: CNS_NAMESPACE)


--image
-
string
ghcr.io/nvidia/cns:latest
No
Agent image (env: CNS_IMAGE)


--image-pull-secret
-
[]string
[]
No
Image pull secrets for private registries


--job-name
-
string
cns
No
K8s Job name


--service-account-name
-
string
cns
No
ServiceAccount name


--node-selector
-
[]string
[]
No
Node selectors (key=value)


--toleration
-
[]string
[]
No
Tolerations (key=value:effect). Default: all taints tolerated


--timeout
-
duration
5m
No
Job completion timeout


--cleanup
-
bool
true
No
Remove resources after completion


--output
-o
string
stdout
No
Output destination


--format
-f
string
yaml
No
Output format (yaml, json, table)


--kubeconfig
-k
string
(auto)
No
Path to kubeconfig file


Key Observations:

The --cleanup flag defaults to true since PR/commit fixing it
--toleration when empty uses universal toleration (operator: Exists)
--kubeconfig flag is present but not used in local snapshot mode (only agent mode)

2.2 Recipe Command Flags

File: pkg/cli/recipe.go:21-145


Flag
Alias
Type
Default
Required
Description


--service
-
string
-
No
K8s service type (eks, gke, aks, oke)


--accelerator
--gpu
string
-
No
GPU type (h100, gb200, a100, l40)


--intent
-
string
-
No
Workload intent (training, inference)


--os
-
string
-
No
OS type (ubuntu, rhel, cos, amazonlinux)


--nodes
-
int
0
No
Number of GPU nodes


--snapshot
-s
string
-
No
Path/URI to snapshot


--output
-o
string
stdout
No
Output destination


--format
-f
string
yaml
No
Output format


--kubeconfig
-k
string
(auto)
No
Kubeconfig for ConfigMap access


Key Observations:

Either criteria flags OR --snapshot should be provided
If --snapshot provided, criteria are extracted from it
CLI criteria flags override snapshot-extracted values
Validation added in PR #14: at least one criteria required

2.3 Bundle Command Flags

File: pkg/cli/bundle.go:25-202


Flag
Alias
Type
Default
Required
Description


--recipe
-r
string
-
Yes
Path/URI to recipe


--bundlers
-b
[]string
[]
No
Bundler types to execute


--output
-o
string
.
No
Output directory


--set
-
[]string
[]
No
Value overrides (bundler:path=value)


--system-node-selector
-
[]string
[]
No
System component node selectors


--system-node-toleration
-
[]string
[]
No
System component tolerations


--accelerated-node-selector
-
[]string
[]
No
GPU node selectors


--accelerated-node-toleration
-
[]string
[]
No
GPU node tolerations


--deployer
-
string
script
No
Deployment method (script, argocd, flux)


Key Observations:

--output is a directory here vs file for other commands (H19)
Has --kubeconfig flag defined but never used in code (see H23)
--bundlers is case-insensitive since PR #17
When --bundlers empty, all registered bundlers execute

2.4 Validate Command Flags

File: pkg/cli/validate.go:20-167


Flag
Alias
Type
Default
Required
Description


--recipe
-r
string
-
Yes
Path/URI to recipe


--snapshot
-s
string
-
Yes
Path/URI to snapshot


--fail-on-error
-
bool
false
No
Exit non-zero on validation failure


--output
-o
string
stdout
No
Output destination


--format
-f
string
yaml
No
Output format


--kubeconfig
-k
string
(auto)
No
Kubeconfig for ConfigMap access


Key Observations:

Both --recipe and --snapshot are required
Without --fail-on-error, validation failures return exit code 0 (H13)
Supports ConfigMap URIs for both inputs

2.5 Flag Consistency Analysis


Aspect
Commands
Status
Issue


--format alias -f
All with format
✅ Consistent
Fixed in PR #12


--output alias -o
All
✅ Consistent
-


--output meaning
bundle=dir, others=file
❌ Inconsistent
H19 (WONTFIX)


--kubeconfig alias -k
snapshot, recipe, validate
✅ Consistent
-


--kubeconfig on bundle
Defined but unused
❌ Dead code
H23 (NEW)


Short alias for --deploy-agent
snapshot
❌ Missing
H2


Short alias for --fail-on-error
validate
❌ Missing
-


3. Call Graphs

3.1 Snapshot Command Call Graph

snapshotCmd() [pkg/cli/snapshot.go:19]
│
├─► Parse CLI flags [snapshot.go:119-168]
│   ├─ serializer.Format(cmd.String("format"))
│   │   └─ Returns JSON, YAML, or Table format
│   ├─ collector.NewDefaultFactory(collector.WithVersion(version))
│   │   └─ Creates factory for GPU, K8s, OS, SystemD collectors
│   └─ snapshotter.NodeSnapshotter{} initialization
│       ├─ Version, Factory, Serializer configured
│       └─ AgentConfig set if --deploy-agent
│
└─► ns.Measure(ctx) [pkg/snapshotter/snapshot.go:42]
    │
    ├─ IF AgentConfig.Enabled:
    │  └─► n.measureWithAgent(ctx) [pkg/snapshotter/agent.go:126-223]
    │      ├─► k8sclient.GetKubeClient(kubeconfig)
    │      │   └─ Returns cached clientset, restconfig
    │      ├─► agent.NewDeployer(clientset, config, opts...)
    │      │   └─ Configures namespace, image, nodeSelector, etc.
    │      │
    │      ├─► deployer.Deploy(ctx) [pkg/k8s/agent/deployer.go:13-47]
    │      │   ├─► d.CheckPermissions(ctx) [permissions.go:11-76]
    │      │   │   └─ SelfSubjectAccessReview for each required permission
    │      │   ├─► d.ensureServiceAccount(ctx) [rbac.go:16-25]
    │      │   ├─► d.ensureRole(ctx) [rbac.go:27-49]
    │      │   ├─► d.ensureRoleBinding(ctx) [rbac.go:51-77]
    │      │   ├─► d.ensureClusterRole(ctx) [rbac.go:79-110]
    │      │   │   └─ **HARDCODED name: "cns-node-reader"** (H24)
    │      │   ├─► d.ensureClusterRoleBinding(ctx) [rbac.go:112-143]
    │      │   │   └─ **HARDCODED name: "cns-node-reader"** (H24)
    │      │   └─► d.ensureJob(ctx) [job.go:12-31]
    │      │       └─► d.buildJob(ctx) [job.go:33-138]
    │      │           ├─ Builds privileged pod spec
    │      │           ├─ Sets nodeSelector, tolerations
    │      │           └─ Adds volume mounts for /run/systemd
    │      │
    │      ├─► deployer.WaitForJobCompletion(ctx, timeout) [wait.go:13-93]
    │      │   ├─ Watch Job status
    │      │   ├─► WaitForPodReady(ctx) [wait.go:96-147]
    │      │   │   └─ Detect pod errors: CrashLoopBackOff, ImagePullBackOff, etc.
    │      │   └─► StreamLogs(ctx) [wait.go:150-195]
    │      │       └─ Stream pod logs with [agent] prefix
    │      │
    │      ├─► deployer.GetSnapshot(ctx) [deployer.go:119-166]
    │      │   └─ Read from ConfigMap, parse YAML
    │      │
    │      └─ defer: deployer.Cleanup(ctx, opts) [deployer.go:72-117]
    │          └─ Delete Job, SA, Role, RoleBinding, ClusterRole, ClusterRoleBinding
    │
    └─ ELSE (local mode):
       └─► n.measure(ctx) [pkg/snapshotter/snapshot.go:53-193]
           ├─ errgroup.WithContext(ctx)
           │
           ├─ g.Go: metadata collection
           │   └─ Hostname, timestamp, version
           │
           ├─ g.Go: k8sCollector.Collect(gctx)
           │   └─ [pkg/collector/k8s/k8s.go]
           │       ├─ Server version from /version
           │       ├─ Pod images from all namespaces
           │       ├─ ClusterPolicy from nvidia.com
           │       └─ Node info (first node)
           │
           ├─ g.Go: systemdCollector.Collect(gctx)
           │   └─ [pkg/collector/systemd/systemd.go]
           │       └─ D-Bus queries for containerd, docker, kubelet
           │       └─ **Graceful degradation** if D-Bus unavailable (PR #19)
           │
           ├─ g.Go: osCollector.Collect(gctx)
           │   └─ [pkg/collector/os/os.go]
           │       ├─ /proc/cmdline (grub params)
           │       ├─ /proc/modules (kmod)
           │       ├─ /proc/sys/* (sysctl)
           │       └─ /etc/os-release
           │
           ├─ g.Go: gpuCollector.Collect(gctx)
           │   └─ [pkg/collector/gpu/gpu.go]
           │       └─ nvidia-smi -q -x
           │       └─ **Graceful degradation** if nvidia-smi missing (PR #18)
           │
           ├─ g.Wait()
           │   └─ Fail-fast on first error
           │
           └─► n.Serializer.Serialize(ctx, snap)
               └─ Output to file, ConfigMap, or stdout

3.2 Recipe Command Call Graph

recipeCmd() [pkg/cli/recipe.go:21]
│
├─► Parse CLI flags
│   └─ serializer.Format(cmd.String("format"))
│
├─► recipe.NewBuilder(recipe.WithVersion(version))
│
├─ IF --snapshot provided:
│  │
│  ├─► serializer.FromFileWithKubeconfig[Snapshot](path, kubeconfig)
│  │   └─ Supports: file path, HTTP/HTTPS URL, cm://namespace/name
│  │
│  ├─► extractCriteriaFromSnapshot(snap) [recipe.go:170-268]
│  │   │
│  │   ├─ TypeK8s → Service detection
│  │   │   ├─ Check K8s.server.version for "-eks-", "-gke", "-aks"
│  │   │   └─ Map to CriteriaServiceEKS, etc.
│  │   │
│  │   ├─ TypeGPU → Accelerator detection
│  │   │   └─ Check gpu.model for "h100", "gb200", "a100", "l40"
│  │   │
│  │   └─ TypeOS → OS detection
│  │       └─ Check OS.release.ID
│  │
│  └─► applyCriteriaOverrides(cmd, criteria) [recipe.go:270-304]
│      └─ CLI flags override snapshot-extracted values
│
├─ ELSE:
│  └─► buildCriteriaFromCmd(cmd) [recipe.go:148-168]
│      └─► recipe.BuildCriteria(opts...)
│          └─ Validation: at least one criteria required (PR #14)
│
└─► builder.BuildFromCriteria(ctx, criteria) [pkg/recipe/builder.go:42-95]
    │
    ├─► loadMetadataStore(ctx) [pkg/recipe/metadata_store.go:39-135]
    │   ├─ fs.WalkDir(metadataFS, "data")
    │   │   └─ Embedded files: base.yaml + overlay/*.yaml
    │   ├─ Parse base.yaml
    │   │   └─ Default components, constraints
    │   └─ Parse overlay/*.yaml files
    │       └─ Environment-specific configurations
    │
    └─► store.BuildRecipeResult(ctx, criteria) [metadata_store.go:169-225]
        │
        ├─► store.FindMatchingOverlays(criteria)
        │   └─ For each overlay:
        │       └─ overlay.Spec.Criteria.Matches(criteria)
        │           └─ Specificity scoring (0-5 points)
        │
        ├─ Merge base with overlays (specificity order)
        │   └─ Lower specificity first, then higher
        │
        ├─ mergedSpec.ValidateDependencies()
        │   └─ Check all dependencyRefs resolve
        │
        ├─ mergedSpec.TopologicalSort()
        │   └─ Order by deploymentOrder
        │
        └─ Return RecipeResult with:
            ├─ Criteria (input + detected)
            ├─ ComponentRefs (with values, overrides)
            ├─ Constraints (validation rules)
            └─ Metadata (appliedOverlays, version)

3.3 Bundle Command Call Graph

bundleCmd() [pkg/cli/bundle.go:25]
│
├─► Parse CLI flags
│   ├─► config.ParseValueOverrides(--set flags)
│   │   └─ Parse "bundler:path.to.field=value" format
│   ├─► snapshotter.ParseNodeSelectors()
│   ├─► snapshotter.ParseTolerations()
│   └─ Validate deployer type: script, argocd, flux
│
├─► serializer.FromFile[RecipeResult](recipePath)
│   └─ **NOTE: Does NOT use kubeconfig flag** (H23)
│
├─► registry.NewFromGlobal(config) [pkg/bundler/registry/registry.go]
│   └─ Auto-registered bundlers via init():
│       ├─ certmanager [pkg/component/certmanager/]
│       ├─ gpuoperator [pkg/component/gpuoperator/]
│       ├─ networkoperator [pkg/component/networkoperator/]
│       ├─ nvsentinel [pkg/component/nvsentinel/]
│       └─ skyhook [pkg/component/skyhook/]
│
├─► bundler.New(opts...) [pkg/bundler/bundler.go:136-173]
│   └─ Apply overrides, node selectors, tolerations
│
└─► b.Make(ctx, recipe, outputDir) [bundler.go:180-244]
    │
    ├─ Validate input (non-nil recipe)
    │
    ├─ Create output directory
    │
    ├─► b.selectBundlers(input, types) [bundler.go:389-425]
    │   └─ If types empty, select all registered bundlers
    │
    ├─► b.makeParallel(ctx, input, dir, bundlers) [bundler.go:248-334]
    │   └─ errgroup.WithContext(ctx)
    │       └─ For each bundler (concurrent):
    │           └─► b.executeBundler(ctx, type, bundler, input, dir)
    │               ├─► bundler.Validate(ctx, input)
    │               │   └─ Check component exists in recipe
    │               └─► bundler.Make(ctx, input, dir)
    │                   ├─ GetComponentRef(name)
    │                   ├─ GetValuesForComponent(name)
    │                   │   └─ Merge: base → valuesFile → overrides → CLI --set
    │                   ├─ CreateBundleDir(subdir)
    │                   ├─ GenerateFileFromTemplate(values.yaml)
    │                   ├─ GenerateFileFromTemplate(install.sh)
    │                   ├─ GenerateFileFromTemplate(uninstall.sh)
    │                   ├─ GenerateFileFromTemplate(README.md)
    │                   └─ GenerateResult() with checksums
    │
    └─► b.createRootArtifacts(ctx, input, dir) [bundler.go:430-461]
        ├─► b.writeRecipeFile(recipe, dir)
        │   └─ Copy recipe.yaml to output
        └─► deployer.Generate(ctx, recipe, dir)
            ├─ ArgoCD: app-of-apps.yaml + Application CRs per component
            ├─ Flux: kustomization.yaml + HelmRelease CRs with dependsOn
            └─ Script: README.md with helm install commands

3.4 Validate Command Call Graph

validateCmd() [pkg/cli/validate.go:20]
│
├─► Parse CLI flags
│   └─ serializer.Format(cmd.String("format"))
│
├─► serializer.FromFileWithKubeconfig[RecipeResult](recipePath, kubeconfig)
│   └─ Supports: file, HTTP/HTTPS URL, cm://namespace/name
│
├─► serializer.FromFileWithKubeconfig[Snapshot](snapshotPath, kubeconfig)
│   └─ Supports: file, HTTP/HTTPS URL, cm://namespace/name
│
├─► validator.New(validator.WithVersion(version))
│
└─► v.Validate(ctx, recipe, snapshot) [pkg/validator/validator.go:49-108]
    │
    ├─► NewValidationResult()
    │
    ├─ For each recipe.Constraints:
    │   └─► v.evaluateConstraint(constraint, snap) [validator.go:111-185]
    │       │
    │       ├─► ParseConstraintPath(constraint.Name)
    │       │   └─ Split "{Type}.{Subtype}.{Key}"
    │       │
    │       ├─► path.ExtractValue(snap)
    │       │   └─ Find matching measurement.subtype.data[key]
    │       │
    │       ├─► ParseConstraintExpression(constraint.Value)
    │       │   └─ Parse operators: >=, <=, ==, !=, >, <
    │       │
    │       └─► parsed.Evaluate(actual)
    │           └─ Version comparison or string match
    │
    ├─ Calculate summary:
    │   ├─ Passed count
    │   ├─ Failed count
    │   ├─ Skipped count (missing data)
    │   └─ Overall status: pass/fail/partial
    │
    └─ Return ValidationResult


4. Agent Deployment System

4.1 Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    User Workstation                              │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  cnsctl snapshot --deploy-agent                              ││
│  │    │                                                        ││
│  │    ├─► CheckPermissions() ─ SelfSubjectAccessReview         ││
│  │    │   └─ Verifies: create configmaps, get nodes, etc.      ││
│  │    │                                                        ││
│  │    ├─► Deploy() ─ Create RBAC + Job                         ││
│  │    │   ├─ ServiceAccount, Role, RoleBinding (namespaced)    ││
│  │    │   ├─ ClusterRole, ClusterRoleBinding (cluster-scoped)  ││
│  │    │   │   └─ **HARDCODED names** (H24)                     ││
│  │    │   └─ Job with privileged pod                           ││
│  │    │                                                        ││
│  │    ├─► WaitForJobCompletion() ─ Watch Job status            ││
│  │    │   ├─ WaitForPodReady() with error detection            ││
│  │    │   │   └─ CrashLoopBackOff, ImagePullBackOff, etc.      ││
│  │    │   └─ StreamLogs() with [agent] prefix                  ││
│  │    │                                                        ││
│  │    ├─► GetSnapshot() ─ Read from ConfigMap                  ││
│  │    │   └─ Parse YAML from data.snapshot.yaml                ││
│  │    │                                                        ││
│  │    └─► Cleanup() ─ Delete resources (if --cleanup=true)     ││
│  │        └─ Attempts all deletions, reports errors            ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   Kubernetes Cluster                             │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Namespace: gpu-operator (default)                         │  │
│  │                                                            │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐       │  │
│  │  │ServiceAcct  │  │    Role     │  │ RoleBinding  │       │  │
│  │  │   "cns"     │  │   "cns"     │  │    "cns"     │       │  │
│  │  └─────────────┘  └─────────────┘  └──────────────┘       │  │
│  │                                                            │  │
│  │  ┌─────────────────────────────────────────────────────┐  │  │
│  │  │                   Job "cns"                          │  │  │
│  │  │  ┌───────────────────────────────────────────────┐  │  │  │
│  │  │  │  Pod (privileged, hostPID/Net/IPC, root)     │  │  │  │
│  │  │  │    ├─ GPU Collector (nvidia-smi)             │  │  │  │
│  │  │  │    ├─ K8s Collector (API client)             │  │  │  │
│  │  │  │    ├─ OS Collector (/proc, /etc)             │  │  │  │
│  │  │  │    └─ SystemD Collector (D-Bus)              │  │  │  │
│  │  │  └───────────────────────────────────────────────┘  │  │  │
│  │  └─────────────────────────────────────────────────────┘  │  │
│  │                         │                                  │  │
│  │                         ▼                                  │  │
│  │  ┌─────────────────────────────────────────────────────┐  │  │
│  │  │         ConfigMap "cns-snapshot"                     │  │  │
│  │  │  labels:                                             │  │  │
│  │  │    app.kubernetes.io/name: cns                       │  │  │
│  │  │    app.kubernetes.io/component: snapshot             │  │  │
│  │  │    app.kubernetes.io/version: <version>              │  │  │
│  │  │  data:                                               │  │  │
│  │  │    snapshot.yaml: "<YAML content>"                   │  │  │
│  │  │    format: yaml                                      │  │  │
│  │  │    timestamp: "2026-01-14T10:30:00Z"                 │  │  │
│  │  └─────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Cluster-Scoped Resources                                  │  │
│  │  ┌───────────────────┐  ┌─────────────────────────────┐   │  │
│  │  │    ClusterRole    │  │    ClusterRoleBinding       │   │  │
│  │  │ "cns-node-reader" │  │   "cns-node-reader"         │   │  │
│  │  │   (HARDCODED!)    │  │     (HARDCODED!)            │   │  │
│  │  └───────────────────┘  └─────────────────────────────┘   │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

4.2 RBAC Permissions

Namespace-Scoped Role (pkg/k8s/agent/rbac.go:27-49)


Resource
Verbs
Purpose


configmaps
create, get, update, patch
Store snapshot data


pods
get, list
Monitor Job pod status


pods/log
get
Stream pod logs


Cluster-Scoped ClusterRole (pkg/k8s/agent/rbac.go:79-110)


Resource
API Group
Verbs
Purpose


nodes
""
get, list
Query node info


pods
""
get, list
List all pods


services
""
get, list
List all services


clusterpolicies
nvidia.com
get, list
NVIDIA GPU policies


Issue H24: ClusterRole/ClusterRoleBinding names are hardcoded to "cns-node-reader"

Cannot customize via --job-name or --service-account-name
Multiple concurrent deployments in different namespaces share same cluster resources
Cleanup in one namespace may affect another

4.3 Security Context (pkg/k8s/agent/job.go:80-138)


Setting
Value
Security Implication


runAsUser
0 (root)
Full system access


privileged
true
Bypass container isolation


hostPID
true
See all host processes


hostNetwork
true
Access host network


hostIPC
true
Access host IPC


capabilities
SYS_ADMIN, SYS_CHROOT
System-level operations


Why Privileged is Required:

nvidia-smi: Requires access to GPU devices
D-Bus: Requires access to system D-Bus socket
/proc files: Requires host PID namespace
SystemD properties: Requires host IPC namespace

Our PR #27 adds --privileged flag to allow unprivileged mode for PSS-restricted clusters.
4.4 Resource Requirements (pkg/k8s/agent/job.go:97-107)


Resource
Request
Limit


CPU
1
2


Memory
4Gi
8Gi


Ephemeral Storage
2Gi
4Gi


Issue M22: These values are hardcoded, no flags to customize.
4.5 Pod Error Detection (pkg/k8s/agent/wait.go:96-147)

The wait logic now detects these pod failure conditions:

ImagePullBackOff
ErrImagePull
InvalidImageName
CrashLoopBackOff
CreateContainerError
CreateContainerConfigError
RunContainerError

Each returns a clear error message with the reason.
4.6 Security Considerations

Attack Surface:

Privileged container - Can escape container to host
Host namespace access - Can observe all system activity
Root execution - Full node access
RBAC persistence - Cluster-scoped resources persist if cleanup fails

Mitigations:

Permission check before deployment (CheckPermissions())
Automatic cleanup on completion (default: enabled)
Resource limits prevent DoS
Single execution (BackoffLimit: 0)
Hard timeout (ActiveDeadlineSeconds: 18000 = 5 hours)
Pod error detection with clear messages


5. Recipe System

5.1 Recipe Structure (RecipeResult)

kind: RecipeResult
apiVersion: cns.nvidia.com/v1alpha1
metadata:
  generatedAt: "2026-01-14T10:00:00Z"
  version: "v0.19.0"
  appliedOverlays:
    - gb200-eks-ubuntu-training
criteria:
  service: eks
  accelerator: gb200
  intent: training
  os: ubuntu
  nodes: 8
componentRefs:
  - name: cert-manager
    type: Helm
    chart: cert-manager
    version: v1.16.2
    repository: https://charts.jetstack.io
    namespace: cert-manager
    deploymentOrder: 1
    valuesFile: components/cert-manager/values.yaml
    overrides:
      installCRDs: true
  - name: gpu-operator
    type: Helm
    chart: gpu-operator
    version: v25.3.4
    repository: https://helm.ngc.nvidia.com/nvidia
    namespace: gpu-operator
    deploymentOrder: 2
    valuesFile: components/gpu-operator/eks-gb200-training.yaml
    dependencyRefs:
      - cert-manager
constraints:
  - name: K8s.server.version
    value: ">= 1.32"
  - name: OS.release.ID
    value: ubuntu
deploymentOrder:
  - cert-manager
  - gpu-operator
  - network-operator
  - nvsentinel
  - skyhook
5.2 Overlay System

File Structure:
pkg/recipe/data/
├── base.yaml                      # Default components and settings
└── *.yaml                         # Overlay files (3 currently)
    ├── gb200-eks-ubuntu-training.yaml
    ├── h100-eks-ubuntu-training.yaml
    └── h100-ubuntu-inference.yaml

Matching Algorithm:

Load all overlays from pkg/recipe/data/*.yaml
For each overlay, check if criteria matches request
Collect all matching overlays
Sort by specificity score (ascending)
Merge: base → less specific → more specific

Specificity Scoring:

Each non-"any" field adds 1 point
Fields: service, accelerator, intent, os (nodes is optional)
Score range: 0-4 (or 0-5 with nodes)

Example:
Query: { service: eks, accelerator: gb200, os: ubuntu, intent: training }

Overlay 1: { service: eks }                          → Score 1, MATCH
Overlay 2: { service: eks, accelerator: gb200 }      → Score 2, MATCH
Overlay 3: { accelerator: h100 }                     → Score 1, NO MATCH

Merge order: base → Overlay 1 → Overlay 2

5.3 Coverage Analysis

Supported Criteria Values:


Criteria
Values
Count


Services
eks, gke, aks, oke
4


Accelerators
h100, gb200, a100, l40
4


Intents
training, inference
2


OS
ubuntu, rhel, cos, amazonlinux
4


Total Specific Combinations: 4 × 4 × 2 × 4 = 128
Current Overlays (3 files):


Overlay
Service
Accelerator
OS
Intent
Specificity


gb200-eks-ubuntu-training
eks
gb200
ubuntu
training
4/4


h100-eks-ubuntu-training
eks
h100
ubuntu
training
4/4


h100-ubuntu-inference
any
h100
ubuntu
inference
3/4


Coverage: 3/128 = 2.34% (Issue C2)
5.4 Coverage Gaps


Gap Category
Missing Combinations
Count


A100 accelerator
All A100 combinations
32


L40 accelerator
All L40 combinations
32


GB200 non-EKS
gke/aks/oke + gb200
24


GB200 inference
Any service + gb200 + inference
16


Non-Ubuntu OS
rhel/cos/amazonlinux + any
96


H100 training non-EKS
gke/aks/oke + h100 + training
3


GKE service
All GKE combinations
32


AKS service
All AKS combinations
32


OKE service
All OKE combinations
32


Impact: Most user queries fall back to base configuration only, missing environment-specific optimizations.

6. Bundler System

6.1 Component Bundlers


Bundler
Bundle Type
Key Outputs


cert-manager
cert-manager
values.yaml, install.sh, README


gpu-operator
gpu-operator
values.yaml, clusterpolicy.yaml, scripts


network-operator
network-operator
values.yaml, scripts, README


skyhook
skyhook
values.yaml, customization CRs, scripts


nvsentinel
nvsentinel
values.yaml, scripts, README


6.2 Bundler Registration Pattern

Each bundler self-registers via init():
// pkg/component/gpuoperator/bundler.go
func init() {
    registry.MustRegister(Name, NewBundler())
}

const Name = types.BundleType("gpu-operator")
Bundler Interface:
type Bundler interface {
    Type() BundleType
    Make(ctx context.Context, input *recipe.RecipeResult, outputDir string) (*Result, error)
}

type ValidatableBundler interface {
    Bundler
    Validate(ctx context.Context, input *recipe.RecipeResult) error
}
6.3 Value Override System (--set)

Format: --set bundler:path.to.field=value
Merge Precedence (lowest to highest):

Base values (from recipe data)
valuesFile content
Recipe overrides field
CLI --set flags

Example Paths by Bundler:
GPU Operator

gpuoperator:operator.nodeSelector=key=value
gpuoperator:daemonsets.nodeSelector=key=value
gpuoperator:dcgmExporter.config.create=true
gpuoperator:gds.enabled=true
gpuoperator:driver.version=570.133.20
gpuoperator:cdi.enabled=true
gpuoperator:mig.strategy=mixed

Network Operator

networkoperator:operator.repository=myregistry.com
networkoperator:ofedDriver.version=23.04
networkoperator:ofedDriver.deploy=true
networkoperator:rdma.enabled=true
networkoperator:sriov.enabled=true

Cert-Manager

certmanager:installCRDs=true
certmanager:nodeSelector=key=value
certmanager:tolerations=...
certmanager:webhook.nodeSelector=key=value

Skyhook

skyhook:manager.resources.cpu.limit=2
skyhook:manager.resources.memory.limit=2Gi
skyhook:customization=ubuntu
skyhook:controllerManager.selectors=key=value

NVSentinel

nvsentinel:namespace=nvsentinel
nvsentinel:sentinel.enabled=true
nvsentinel:sentinel.logLevel=info
nvsentinel:global.systemNodeSelector=key=value

Limitations:

No array index override syntax (e.g., tolerations[0].key=value)
No wildcard paths
Type conversion is automatic (strings → bool/int where appropriate)

6.4 Deployer Types


Type
Outputs
Use Case


script
README with helm commands, install.sh
Manual deployment


argocd
app-of-apps.yaml, Application CRs
GitOps with ArgoCD


flux
kustomization.yaml, HelmRelease CRs
GitOps with Flux


Deployment Order Handling:

Script: Documents order in README
ArgoCD: Uses argocd.argoproj.io/sync-wave annotations
Flux: Uses spec.dependsOn fields


7. Collector System

7.1 Factory Pattern

File: pkg/collector/factory.go
type Factory interface {
    CreateSystemDCollector() Collector
    CreateOSCollector() Collector
    CreateKubernetesCollector() Collector
    CreateGPUCollector() Collector
}
7.2 Collector Details


Collector
Data Sources
Key Outputs
Graceful Degradation


GPU
nvidia-smi -q -x
driver version, CUDA, GPU model, memory, count
Yes (PR #18) - returns gpu.count=0


K8s
Kubernetes API
server version, images, policies, node info
No - requires API access


OS
/proc, /etc
kernel, OS release, sysctl, modules
No - requires /proc access


SystemD
D-Bus
service status (containerd, docker, kubelet)
Yes (PR #19) - empty if D-Bus unavailable


7.3 GPU Collector Details (pkg/collector/gpu/gpu.go)

Data Collection:

Execute nvidia-smi -q -x
Parse XML output
Extract:

Driver version
CUDA version
GPU count
GPU model (per-GPU)
Memory info
MIG configuration


Graceful Degradation (since PR #18):
if errors.Is(err, exec.ErrNotFound) || os.IsNotExist(err) {
    slog.Warn("nvidia-smi not found, returning empty GPU measurements")
    return &measurement.Measurement{
        Type: measurement.TypeGPU,
        Subtypes: []measurement.Subtype{{
            Name: "smi",
            Data: map[string]measurement.Reading{
                "gpu.count": measurement.Int(0),
            },
        }},
    }, nil
}
7.4 K8s Collector Details (pkg/collector/k8s/k8s.go)

Data Collection:

Get server version from /version endpoint
List all pods, extract unique images
Get ClusterPolicy CRDs (nvidia.com/v1)
Get node info (first node only)

Limitations:

Only collects first node info (scalability issue for multi-node clusters)
Lists ALL pods across ALL namespaces (can be slow on large clusters)
No pagination for pod listing

7.5 OS Collector Details (pkg/collector/os/os.go)

Data Collection:

/proc/cmdline → GRUB boot parameters
/proc/modules → Loaded kernel modules
/proc/sys/* → Sysctl parameters
/etc/os-release → OS identification

Platform Assumptions:

Hardcoded Linux paths
Won't work on Windows or macOS (intentional - GPU nodes are Linux)

7.6 SystemD Collector Details (pkg/collector/systemd/systemd.go)

Data Collection:

Connect to system D-Bus
Query properties for:

containerd.service
docker.service
kubelet.service


Graceful Degradation (since PR #19):

Returns empty measurements if D-Bus unavailable
Logs warning but doesn't fail

7.7 Measurement Structure

type: GPU  # or K8s, OS, SystemD
subtypes:
  - name: smi
    data:
      driver-version: "570.133.20"
      cuda-version: "12.8"
      gpu.count: 8
      gpu.model: "NVIDIA H100"
    context:
      driver-version: "NVIDIA driver version installed on the system"

8. Serializer System

8.1 Output Formats


Format
Extension
Description
Read Support
Write Support


json
.json
Pretty-printed JSON
Yes
Yes


yaml
.yaml
YAML with 2-space indent
Yes
Yes


table
-
Flattened key-value table
No
Yes only


8.2 Output Destinations


Destination
Format
Example
Implementation


File
path
/tmp/snapshot.yaml
writer.go:NewFileWriter


ConfigMap
cm://namespace/name
cm://default/cns-snapshot
configmap.go


HTTP URL
https://...
https://example.com/snap.yaml
http.go (read only)


Stdout
- or empty
Default
writer.go:NewStdoutWriter


8.3 ConfigMap Storage (pkg/serializer/configmap.go)

Write Flow (after PR #32 - Server-Side Apply):
func (w *ConfigMapWriter) Serialize(ctx context.Context, data any) error {
    // 1. Marshal data to YAML/JSON
    content, err := serializeYAML(data)

    // 2. Build ConfigMap apply configuration
    configMap := accorev1.ConfigMap(w.name, w.namespace).
        WithLabels(map[string]string{
            "app.kubernetes.io/name":      "cns",
            "app.kubernetes.io/component": "snapshot",
            "app.kubernetes.io/version":   version,
        }).
        WithData(map[string]string{
            "snapshot.yaml": string(content),
            "format":        string(w.format),
            "timestamp":     time.Now().UTC().Format(time.RFC3339),
        })

    // 3. Atomic Server-Side Apply (creates or updates)
    _, err = client.CoreV1().ConfigMaps(w.namespace).Apply(
        ctx,
        configMap,
        metav1.ApplyOptions{FieldManager: "cnsctl"},
    )
    return err
}
Key improvement: PR #32 replaced the race-prone Get-then-Create/Update pattern with atomic Server-Side Apply (SSA), eliminating data loss in concurrent writes.
Issues (Status):

Race condition: Get-then-Create is not atomic ✅ Fixed by PR #32 (SSA)
No context timeout: Long-running writes can block indefinitely (has 30s timeout)
Silent fallback: Invalid paths silently fall back to stdout 🔄 PR #24 OPEN

8.4 URI Parsing

func ParseURI(uri string) (scheme, namespace, name string, err error) {
    // Supports:
    // - cm://namespace/name     → ConfigMap
    // - https://example.com/... → HTTP
    // - /path/to/file           → File
    // - -                       → Stdout
}

9. Documentation Analysis

9.1 Documentation Structure

docs/
├── OVERVIEW.md                    # High-level product overview
├── architecture/
│   ├── README.md                  # Architecture overview (1264 lines!)
│   ├── api-server.md              # API server architecture
│   ├── cli.md                     # CLI architecture
│   ├── component.md               # Bundler component guide
│   └── data.md                    # Recipe data architecture
├── demos/
│   ├── e2e.md                     # End-to-end demo
│   └── s3c.md                     # S3C demo
├── integration/
│   ├── api-reference.md           # API reference (695 lines)
│   ├── automation.md              # CI/CD integration
│   ├── data-flow.md               # Data flow documentation
│   ├── kubernetes-deployment.md   # K8s deployment guide
│   └── recipe-development.md      # Recipe development guide
└── user-guide/
    ├── agent-deployment.md        # Agent deployment guide (900 lines)
    ├── api-reference.md           # User-facing API reference
    ├── cli-reference.md           # CLI reference (900 lines)
    └── installation.md            # Installation guide

9.2 Documentation Quality Assessment


Document
Lines
Quality
Issues


architecture/README.md
1264
Excellent
Very comprehensive, good diagrams


user-guide/cli-reference.md
900
Excellent
Complete flag documentation


user-guide/agent-deployment.md
900
Good
Fixed in PR #15


integration/api-reference.md
695
Good
Complete API documentation


architecture/data.md
865
Excellent
Detailed overlay system explanation


integration/recipe-development.md
650
Good
Helpful for contributors


9.3 Documentation Findings

Strengths:

Comprehensive CLI reference with all flags documented
Good architecture documentation with mermaid diagrams
Clear examples in most documents
Recipe data architecture well explained

Gaps:

No changelog (M26)
No quick start guide (L12)
No troubleshooting guide beyond basic tips
Some documents reference draft features


10. Build System Analysis

10.1 Makefile Targets

File: Makefile (145 lines)


Target
Description
Dependencies


info
Print project info
-


tidy
Update Go modules
-


upgrade
Upgrade all dependencies
-


lint
Lint Go and YAML
lint-go, lint-yaml


lint-go
Run golangci-lint
-


lint-yaml
Run yamllint
-


test
Run unit tests with race detector
-


e2e
Run integration tests
tools/e2e


scan
Vulnerability scan (go vet + grype)
-


qualify
Full qualification
test, lint, e2e, scan


server
Start development server
-


docs
Serve Go documentation
-


build
Build release binaries
tidy


image
Build and push container image
-


release
Run goreleaser
-


bump-major/minor/patch
Version bumping
tools/bump


clean
Clean directories
-


help
Show available targets
-


10.2 Build Configuration

Go Version: Uses go env GOVERSION (documented in info)
Linting:

golangci-lint with .golangci.yaml config
yamllint with .yamllint.yaml config

Release:

goreleaser with .goreleaser.yaml config
Multi-platform binaries (darwin/linux, amd64/arm64)

Container Images:

Built with ko
Registry: ghcr.io/nvidia (configurable via IMAGE_REGISTRY)
Tag: latest (configurable via IMAGE_TAG)

10.3 Deployment YAMLs

Location: deployments/cns-agent/


File
Purpose


1-deps.yaml
RBAC resources (SA, Role, RoleBinding, ClusterRole, ClusterRoleBinding)


2-job.yaml
Job manifest for agent deployment


1-deps.yaml Analysis:

Creates namespace-scoped RBAC (cns service account, role, rolebinding)
Creates cluster-scoped RBAC (cns-node-reader clusterrole, clusterrolebinding)
Includes secret list permission (potential security concern)

2-job.yaml Analysis:

Uses hardcoded nodeSelector: nodeGroup: customer-gpu
Uses specific tolerations for dedicated=user-workload
Image: ghcr.io/mchmarny/cns:latest (should be ghcr.io/nvidia/cns:latest)
Privileged security context

Issues Found:

Image points to mchmarny fork instead of nvidia (likely test config)
NodeSelector is environment-specific
Tolerations are environment-specific


11. Issue Catalog

11.1 Critical Issues (1 open, 3 fixed, 1 wontfix)

C1: Privileged Container Required for Snapshot

Status: ✅ FIXED (PR #27)
File: pkg/k8s/agent/job.go:62-70
Impact: Cannot deploy on PSS-restricted clusters without exemption
Context: The agent Job requires privileged: true security context to:

Access nvidia-smi for GPU metrics
Read D-Bus socket for SystemD service status
Access /proc files with host PID namespace

Why It Matters: Many enterprise Kubernetes clusters enforce Pod Security Standards (PSS) at "restricted" or "baseline" level, which prohibit privileged containers. This prevents CNS agent deployment without cluster policy exceptions.
Fix (PR #27): Adds --privileged flag (default: true) allowing --privileged=false for PSS-restricted environments. In unprivileged mode, GPU and SystemD collectors return empty/degraded results.

C2: Only 2.34% Overlay Coverage

Status: Open
File: pkg/recipe/data/*.yaml
Impact: Most configurations use base-only settings
Context: With only 3 overlay files covering 3/128 possible criteria combinations, most user queries fall through to the base configuration without environment-specific optimizations.
Why It Matters: The value proposition of CNS is hardware-aware, environment-specific configuration generation. Without overlays for A100, L40, GKE, AKS, OKE, or non-Ubuntu OS, users get generic configurations that may not be optimal for their environment.
Missing Coverage:

A100 GPUs (common in existing deployments)
L40 GPUs (common for inference)
GKE, AKS, OKE platforms (major cloud providers)
RHEL, COS, Amazon Linux (common enterprise OSes)
Inference workloads on GB200


C3: RBAC Cleanup May Fail Silently

Status: ✅ FIXED (PR #16)
File: pkg/k8s/agent/deployer.go:72-117
Fix: Now attempts all deletions and reports errors

C4: No Validation for Unsupported Criteria Combos

Status: ✅ FIXED (PR #14)
File: pkg/cli/recipe.go:110-112
Fix: Validates at least one criteria is provided

C5: No Bundle Validation Before Write

Status: ⏸️ WONTFIX
Rationale: Input validation exists; output validation adds complexity without clear benefit

11.2 High Priority Issues (0 open, 12 fixed, 7 wontfix)

H1: --format Had Unintuitive -t Alias

Status: ✅ FIXED (PR #12)
Fix: Now uses -f consistently

H2: No Short Alias for --deploy-agent

Status: ⏸️ WONTFIX
Rationale: Verbosity is intentional for safety. --deploy-agent has significant side effects (creates K8s Job, RBAC, runs containers). Short aliases like -a or -d make accidental deployment too easy. The flag is typically used in scripts where verbosity doesn't hurt UX.

H6: No Progress Indicator During Job Wait

Status: ✅ PARTIALLY FIXED (PR #20)
Fix: Log streaming now provides real-time output with [agent] prefix

H7: Criteria Validation Happens Late

Status: ⏸️ WONTFIX
Rationale: Without --snapshot, validation is already immediate. With --snapshot, the snapshot must be loaded anyway to extract criteria. Moving enum validation to flag parsing requires custom flag types in urfave/cli v3, adding significant complexity for a narrow edge case.

H8: No Warning When Using Base-Only Config

Status: ✅ FIXED (PR #31)
File: pkg/recipe/metadata_store.go:199-205
Fix: Added slog.Warn() when no overlays match criteria. Warning includes criteria used and hint about potential optimization gap.
Example output: no environment-specific overlays matched, using base configuration only

H11: Bundler Name Case-Sensitive

Status: ✅ FIXED (PR #17)
Fix: Now case-insensitive with typo suggestions

H12: No Suggestions for Failed Constraints

Status: ⏸️ WONTFIX
Rationale: Constraint failures are environment-specific; generic suggestions would be misleading

H13: Exit Code Always 0 Unless --fail-on-error

Status: ✅ FIXED (PR #30)
File: pkg/cli/validate.go
Fix: Changed --fail-on-error to default to true. Users can opt-out with --fail-on-error=false for informational mode.

H14: GPU Collector Fails Silently if nvidia-smi Missing

Status: ✅ FIXED (PR #18)
Fix: Graceful degradation, returns gpu-count=0

H15: SystemD Collector Requires D-Bus Access

Status: ✅ FIXED (PR #19)
Fix: Graceful degradation when D-Bus unavailable

H16: ConfigMap Write Silently Falls Back to Stdout

Status: ✅ FIXED (PR #24)
File: pkg/serializer/writer.go:34-67
Fix: Returns an error instead of silent fallback when ConfigMap URI is invalid or inaccessible.

H17: agent-deployment.md Had Inaccuracies

Status: ✅ FIXED (PR #15)

H19: Output Flag -o Means File vs Directory

Status: ⏸️ WONTFIX
Rationale: Changing this would break existing workflows; documented behavior

H20: Format Validation Happens in Action, Not Flag

Status: ⏸️ WONTFIX
File: pkg/cli/snapshot.go:121-124 (and similar)
Impact: Late error discovery
Context: Format validation (yaml, json, table) happens after the command starts executing, not during flag parsing.
Rationale: Validation happens as the first operation in Action handlers, so the practical impact is minimal. No expensive operations run before format validation.

H21: Job Logs Not Streamed During Wait

Status: ✅ FIXED (PR #20)
Fix: Logs now streamed with [agent] prefix

H22: Recipe Command --snapshot Doesn't Support All URI Types

Status: ⏸️ WONTFIX
File: pkg/cli/recipe.go:84-90
Impact: Inconsistent URI support
Context: The recipe command's --snapshot flag supports file paths, HTTP/HTTPS URLs, and ConfigMap URIs.
Rationale: Issue overstated - the flag documentation already clearly states "Supports: file paths, HTTP/HTTPS URLs, or ConfigMap URIs" and error messages are reasonably specific.

H23: Bundle Command Missing --kubeconfig Flag

Status: ✅ FIXED (PR #29)
File: pkg/cli/bundle.go
Fix: Added kubeconfigFlag to the bundle command and uses FromFileWithKubeconfig to load recipes. Enables loading recipes from ConfigMap URIs.

H24: ClusterRole/ClusterRoleBinding Names Hardcoded

Status: ⏸️ WONTFIX
Rationale: ClusterRole/ClusterRoleBinding are cluster-scoped and intentionally shared. Having a single "cns-node-reader" role is simpler and avoids role proliferation. The permissions are read-only and safe to share across namespaces.

H25: ConfigMap Race Condition

Status: ✅ FIXED (PR #32)
File: pkg/serializer/configmap.go:109-132
Fix: Replaced Get-then-Create/Update with Kubernetes Server-Side Apply (SSA). Single atomic operation handles both create and update. Field ownership tracked via FieldManager: "cnsctl".

11.3 Medium Priority Issues (8 open, 4 fixed, 4 wontfix)


ID
Category
Issue
Status


M1
CLI
No command aliases (e.g., snap for snapshot)
Open


M2
CLI
Help text formatting inconsistent
Open


M3
CLI
No examples in command help
✅ FIXED (PR #34)


M4
CLI
Error messages don't suggest fixes
Open


M5
CLI
No progress output for long operations
✅ PARTIALLY FIXED (PR #22)


M6
CLI
--kubeconfig shown for all commands but not always used
⏸️ WONTFIX (inaccurate - all commands that have it use it; bundle missing it is H23)


M7
CLI
Completion command hidden
✅ FIXED (PR #8)


M8
Recipe
Overlay files not validated at load time
Open


M9
Recipe
No dry-run mode
Open


M11
Bundle
No component dependency visualization
Open


M18
Collector
OS collector assumes Linux paths
⏸️ WONTFIX (Linux-only is intentional - tool is for Linux GPU nodes)


M21
Agent
Job name collisions possible
⏸️ WONTFIX


M22
Agent
No resource limit customization flags
Open


M26
Docs
No changelog
Open


M27
Build
deployments/cns-agent/2-job.yaml uses fork image registry
✅ FIXED (PR #35)


M28
K8s Collector
Only collects first node info
⏸️ WONTFIX (by design - collects current node via NODE_NAME env var)


11.4 Low Priority Issues (18 open, 0 fixed, 0 wontfix)


ID
Category
Issue
Status


L1
CLI
Version output format not customizable
Open


L2
CLI
No shell completion for flag values
Open


L3
CLI
Debug output very verbose
Open


L4
Recipe
Component versions hardcoded in overlays
Open


L5
Bundle
README templates not customizable
Open


L6
Bundle
Script templates assume bash
Open


L7
Validate
No constraint grouping in output
Open


L8
Collector
Metrics exposed but not documented
Open


L9
Serializer
No compression option
Open


L10
Agent
Labels not customizable
Open


L11
Agent
No annotations support
Open


L12
Docs
No quick start guide
Open


L13
Docs
No comparison with alternatives
Open


L14
Docs
No video tutorials
Open


L15
CLI
No quiet mode
Open


L16
Bundle
Silently overwrites existing output directory
Open (E2E)


L17
CLI
Local snapshot on macOS doesn't suggest --deploy-agent
Open (E2E)


L18
CLI
Mixed stdout/stderr output ordering
Open (E2E)


12. UX Improvement Roadmap

Phase 1: Quick Wins (Low Effort, High Impact)


Fix H23: Add missing --kubeconfig flag to bundle command ✅ MERGED (PR #29)
Fix M27: Update deployments/2-job.yaml to use correct image registry ✅ MERGED (PR #35)
Fix H16: Return error instead of silent fallback for ConfigMap writes ✅ MERGED (PR #24)
Fix C1: Add --privileged flag for PSS compliance ✅ MERGED (PR #27)
Fix H13: Default --fail-on-error to true ✅ MERGED (PR #30)
Fix H8: Warn when using base-only config ✅ MERGED (PR #31)
Fix H25: Use SSA for atomic ConfigMap updates ✅ MERGED (PR #32)
Fix M3: Add command examples to help text ✅ MERGED (PR #34)

Phase 2: CLI Consistency


Add short alias for --deploy-agent (H2) ⏸️ WONTFIX
Move format validation to flag parsing (H20) ⏸️ WONTFIX
Move criteria validation to flag parsing (H7) ⏸️ WONTFIX
Add warning when using base-only config (H8) ✅ PR #31
Add command aliases (M1)

Phase 3: Recipe Coverage (C2)


Add A100 overlays - Common existing deployments
Add L40 overlays - Common inference workloads
Add GKE/AKS overlays - Major cloud providers
Add RHEL overlays - Enterprise Linux
Add inference overlays for all GPUs - Complete workload coverage

Phase 4: Agent Improvements


Fix H24: Make ClusterRole names configurable ⏸️ WONTFIX
Fix H25: Use atomic ConfigMap updates ✅ MERGED (PR #32)
Add resource limit flags (M22)
Add labels/annotations flags (L10, L11)

Phase 5: Documentation


Add changelog (M26)
Add quick start guide (L12)
Add troubleshooting guide
Add architecture diagrams to README

Phase 6: Observability


Document exposed metrics (L8)
Add structured telemetry
Add timing information to outputs


13. Appendices

Appendix A: File Reference


Component
Key Files


CLI
pkg/cli/*.go


Recipe
pkg/recipe/*.go, pkg/recipe/data/*.yaml


Bundler
pkg/bundler/*.go, pkg/component/*/


Deployer
pkg/deployer/provider/*/


Collector
pkg/collector/*/


Snapshotter
pkg/snapshotter/*.go


Agent
pkg/k8s/agent/*.go


Serializer
pkg/serializer/*.go


Validator
pkg/validator/*.go


K8s Client
pkg/k8s/client/*.go


Appendix B: Criteria Values

Service Types:

eks - Amazon EKS
gke - Google GKE
aks - Azure AKS
oke - Oracle OKE
self-managed - Self-managed Kubernetes

Accelerator Types:

h100 - NVIDIA H100
gb200 - NVIDIA GB200
a100 - NVIDIA A100
l40 - NVIDIA L40

Intent Types:

training - ML training workloads
inference - ML inference workloads

OS Types:

ubuntu - Ubuntu Linux
rhel - Red Hat Enterprise Linux
cos - Container-Optimized OS (GKE)
amazonlinux - Amazon Linux

Appendix C: Constraint Path Format

{Type}.{Subtype}.{Key}

Supported Types:
- K8s
- GPU
- OS
- SystemD

Examples:
- K8s.server.version
- GPU.smi.driver-version
- GPU.smi.cuda-version
- GPU.smi.gpu.count
- OS.release.ID
- OS.release.VERSION_ID
- OS.sysctl./proc/sys/kernel/osrelease
- OS.kmod.nvidia
- SystemD.containerd.service.ActiveState

Appendix D: Exit Codes


Code
Current Meaning


0
Success (or validation passed, even with failures unless --fail-on-error)


1
Any error


Recommended Enhancement:


Code
Proposed Meaning


0
Success


1
User error (invalid flags)


2
Execution error (API failures)


3
Validation failure (with --fail-on-error)


Appendix E: Environment Variables


Variable
Used By
Default
Description


CNS_NAMESPACE
snapshot
gpu-operator
Agent deployment namespace


CNS_IMAGE
snapshot
ghcr.io/nvidia/cns:latest
Agent container image


KUBECONFIG
snapshot, recipe, validate
~/.kube/config
Kubernetes config path


LOG_LEVEL
all
info
Logging level


NO_COLOR
all
false
Disable colored output


Revision History


Version
Date
Changes


4.3
2026-01-15
Added PR #34 (M3) and #35 (M27). Total: 58 issues (27 open, 19 fixed, 12 wontfix). Phase 1 complete!


4.2
2026-01-15
Major refresh: All 7 PRs now MERGED (#24, #27, #29, #30, #31, #32, #33)


4.1
2026-01-15
Added L16-L18 from E2E testing


4.0
2026-01-14
Complete fresh analysis with deep context. Added H22-H25, M27-M28


Quick Reference: Issue Status Legend


Symbol
Meaning


✅ FIXED
Issue resolved and merged to upstream


✅ PARTIALLY FIXED
Issue improved but not fully resolved


⏸️ WONTFIX
Issue acknowledged but intentionally not fixing


Open
Issue confirmed, no fix submitted yet


(E2E)
Issue identified during E2E testing


Our Merged PRs (9 total)


PR
Issue
Description
Status


#35
M27
Fix image registry in example Job manifest
✅ MERGED


#34
M3
Add examples to recipe and bundle command help
✅ MERGED


#33
(E2E)
Log when CLI flags override snapshot-detected criteria
✅ MERGED


#32
H25
Use SSA for atomic ConfigMap updates
✅ MERGED


#31
H8
Warn when using base-only config
✅ MERGED


#30
H13
Default --fail-on-error to true
✅ MERGED


#29
H23
Enable kubeconfig support for bundle command
✅ MERGED


#27
C1
Add --privileged flag for PSS compliance
✅ MERGED


#24
H16
Return error instead of silent fallback
✅ MERGED


Document generated by Claude Opus 4.5 based on comprehensive codebase analysis.
Last synced with upstream: 2026-01-15 (commit a68ee61)
End of Document
PR	Issue	Description	Status
#35	M27	Fix image registry in example Job manifest	✅ MERGED
#34	M3	Add examples to recipe and bundle command help	✅ MERGED
#33	(E2E)	Log when CLI flags override snapshot-detected criteria	✅ MERGED
#32	H25	Use SSA for atomic ConfigMap updates	✅ MERGED
#31	H8	Warn when using base-only config	✅ MERGED
#30	H13	Default --fail-on-error to true	✅ MERGED
#29	H23	Enable kubeconfig support for bundle command	✅ MERGED
#27	C1	Add --privileged flag for PSS compliance	✅ MERGED
#24	H16	Return error instead of silent fallback	✅ MERGED
PR	Description	Status
#28	Dependency upgrades	✅ MERGED
#26	Add Flox Env for Dev Tooling	✅ MERGED
#23	Add --image-pull-secret flag	✅ MERGED
#22	Add info logging to collectors	✅ MERGED
#21	Add make image target	✅ MERGED
#20	Stream agent Job logs during wait	✅ MERGED
#19	Graceful degradation when D-Bus unavailable	✅ MERGED
#18	Graceful degradation when nvidia-smi missing	✅ MERGED
#17	Case-insensitive bundle type with typo suggestions	✅ MERGED
#16	Improve resource cleanup error handling	✅ MERGED
#15	Fix agent-deployment.md documentation	✅ MERGED
#14	Add validation for recipe criteria	✅ MERGED
#12	Standardize CLI flag aliases	✅ MERGED
Flag	Type	Default	Env Var	Description
`--debug`	bool	false	`CNS_DEBUG`	Enable debug logging
`--log-json`	bool	false	`CNS_LOG_JSON`	Enable structured JSON logging
Flag	Alias	Type	Default	Used By
`--output`	`-o`	string	stdout	snapshot, recipe, validate, bundle
`--format`	`-f`	string	yaml	snapshot, recipe, validate
`--kubeconfig`	`-k`	string	(auto)	snapshot, recipe, validate
Flag	Alias	Type	Default	Required	Description
`--deploy-agent`	-	bool	false	No	Deploy K8s Job for snapshot
`--namespace`	-	string	gpu-operator	No	Agent namespace (env: `CNS_NAMESPACE`)
`--image`	-	string	ghcr.io/nvidia/cns:latest	No	Agent image (env: `CNS_IMAGE`)
`--image-pull-secret`	-	[]string	[]	No	Image pull secrets for private registries
`--job-name`	-	string	cns	No	K8s Job name
`--service-account-name`	-	string	cns	No	ServiceAccount name
`--node-selector`	-	[]string	[]	No	Node selectors (key=value)
`--toleration`	-	[]string	[]	No	Tolerations (key=value:effect). Default: all taints tolerated
`--timeout`	-	duration	5m	No	Job completion timeout
`--cleanup`	-	bool	true	No	Remove resources after completion
`--output`	`-o`	string	stdout	No	Output destination
`--format`	`-f`	string	yaml	No	Output format (yaml, json, table)
`--kubeconfig`	`-k`	string	(auto)	No	Path to kubeconfig file
Flag	Alias	Type	Default	Required	Description
`--service`	-	string	-	No	K8s service type (eks, gke, aks, oke)
`--accelerator`	`--gpu`	string	-	No	GPU type (h100, gb200, a100, l40)
`--intent`	-	string	-	No	Workload intent (training, inference)
`--os`	-	string	-	No	OS type (ubuntu, rhel, cos, amazonlinux)
`--nodes`	-	int	0	No	Number of GPU nodes
`--snapshot`	`-s`	string	-	No	Path/URI to snapshot
`--output`	`-o`	string	stdout	No	Output destination
`--format`	`-f`	string	yaml	No	Output format
`--kubeconfig`	`-k`	string	(auto)	No	Kubeconfig for ConfigMap access
Flag	Alias	Type	Default	Required	Description
`--recipe`	`-r`	string	-	Yes	Path/URI to recipe
`--bundlers`	`-b`	[]string	[]	No	Bundler types to execute
`--output`	`-o`	string	`.`	No	Output directory
`--set`	-	[]string	[]	No	Value overrides (bundler:path=value)
`--system-node-selector`	-	[]string	[]	No	System component node selectors
`--system-node-toleration`	-	[]string	[]	No	System component tolerations
`--accelerated-node-selector`	-	[]string	[]	No	GPU node selectors
`--accelerated-node-toleration`	-	[]string	[]	No	GPU node tolerations
`--deployer`	-	string	script	No	Deployment method (script, argocd, flux)
Aspect	Commands	Status	Issue
`--format` alias `-f`	All with format	✅ Consistent	Fixed in PR #12
`--output` alias `-o`	All	✅ Consistent	-
`--output` meaning	bundle=dir, others=file	❌ Inconsistent	H19 (WONTFIX)
`--kubeconfig` alias `-k`	snapshot, recipe, validate	✅ Consistent	-
`--kubeconfig` on bundle	Defined but unused	❌ Dead code	H23 (NEW)
Short alias for `--deploy-agent`	snapshot	❌ Missing	H2
Short alias for `--fail-on-error`	validate	❌ Missing	-
Resource	Verbs	Purpose
configmaps	create, get, update, patch	Store snapshot data
pods	get, list	Monitor Job pod status
pods/log	get	Stream pod logs
Resource	API Group	Verbs	Purpose
nodes	""	get, list	Query node info
pods	""	get, list	List all pods
services	""	get, list	List all services
clusterpolicies	nvidia.com	get, list	NVIDIA GPU policies