dims/cns-e2e-report.md

## cns-e2e-report.md

      
    Raw
  

              cns-e2e-report.md
            
          
    Cloud Native Stack (CNS) End-to-End Demo Report

Date: January 13, 2026
Cluster: AWS EKS (us-east-1)
CNS Version: v0.17.2-next
Author: Claude Opus 4.5 Analysis

Table of Contents


Executive Summary
What is Cloud Native Stack?
The Four-Stage Workflow
Environment Discovery and Parameter Selection
Stage 1: Snapshot - Capturing System State
Stage 2: Recipe - Generating Configuration Recommendations
Stage 3: Validate - Checking Compatibility
Stage 4: Bundle - Creating Deployment Artifacts
What Happens After Stage 4?
Key Architectural Concepts
Appendix A: ConfigMap Contents
Appendix B: Complete Command Reference
Appendix C: Generated Bundle Files


1. Executive Summary

This report documents the end-to-end workflow of NVIDIA's Cloud Native Stack (CNS) - a suite of tooling designed to automate the complexity of deploying GPU-accelerated Kubernetes infrastructure. We executed all four stages of the CNS workflow on a live AWS EKS cluster with H100 GPUs and successfully:

Stage 1: Captured a comprehensive system snapshot from GPU nodes
Stage 2: Generated an optimized recipe based on the detected environment
Stage 3: Validated the cluster configuration against recipe constraints
Stage 4: Generated ArgoCD deployment artifacts ready for GitOps deployment

All stages completed successfully with 4 out of 4 validation constraints passing.

2. What is Cloud Native Stack?

The Problem CNS Solves

Running NVIDIA-accelerated Kubernetes clusters reliably is challenging. Small differences in:

Kernel versions
GPU drivers
Container runtimes
Kubernetes releases
Operating system configurations

...can cause failures that are difficult to diagnose and expensive to reproduce.
Historically, this knowledge lived in internal validation pipelines, playbooks, and tribal knowledge. Cloud Native Stack externalizes that experience, making validated configurations visible, repeatable, and reusable.
What CNS Is

CNS is a source of validated configuration knowledge for NVIDIA-accelerated Kubernetes environments. It includes:


Component
Description


cnsctl
Command-line tool implementing all four workflow stages


cnsd
API server for programmatic access to recipes and bundles


Agent
Kubernetes Job that captures snapshots on GPU nodes


Recipe Data
Embedded database of validated configurations


Bundlers
Plugins that generate deployment artifacts (GPU Operator, NVSentinel, Skyhook, cert-manager)


Deployers
GitOps integration (ArgoCD, Flux, or shell scripts)


What CNS Is NOT


Not a Kubernetes distribution
Not a cluster provisioning or lifecycle management system
Not a managed control plane or hosted service
Not a replacement for cloud provider platforms


3. The Four-Stage Workflow

CNS operates through a logical four-stage workflow that transforms raw system state into deployable packages:
┌──────────────┐      ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│   Snapshot   │─────▶│    Recipe    │─────▶│   Validate   │─────▶│    Bundle    │
└──────────────┘      └──────────────┘      └──────────────┘      └──────────────┘
     Stage 1              Stage 2               Stage 3              Stage 4
   (Capture)            (Optimize)             (Check)             (Package)


Stage
Purpose
Input
Output


Snapshot
Capture system configuration
Live cluster/node
YAML snapshot (file or ConfigMap)


Recipe
Generate optimized configuration
Snapshot or query parameters
Recipe with component versions


Validate
Check cluster compatibility
Recipe + Snapshot
Validation results (pass/fail)


Bundle
Create deployment artifacts
Recipe
Helm values, manifests, scripts


4. Environment Discovery and Parameter Selection

Before executing the CNS workflow, I needed to understand the cluster environment to select appropriate parameters. Here's exactly how I discovered the necessary information:
4.1 Cluster Discovery Commands

# 1. Verify cluster connectivity
kubectl cluster-info
# Result: EKS cluster in us-east-1

# 2. List all nodes
kubectl get nodes -o wide
# Found 9 nodes with various roles

# 3. Find GPU nodes specifically
kubectl get nodes -l nvidia.com/gpu.present=true -o wide
# Found 2 nodes: ip-10-0-180-238 and ip-10-0-182-128
4.2 GPU Node Analysis

I examined the GPU nodes in detail:
kubectl describe node ip-10-0-180-238.ec2.internal | grep -E "(nvidia|gpu)" -i
Key findings from node labels:

nvidia.com/gpu.present=true - Confirms GPU presence
nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3 - H100 GPUs
nvidia.com/gpu.count=8 - 8 GPUs per node
nvidia.com/cuda.driver-version.full=570.133.20 - Driver version
nodeGroup=customer-gpu - This became my node selector
dedicated=user-workload - Node taint key/value

4.3 Parameter Selection Rationale

Based on the cluster discovery, I selected these parameters:


Parameter
Value
How I Found It


--namespace
gpu-operator
Standard namespace for GPU components


--node-selector
nodeGroup=customer-gpu
Label on GPU nodes


--accelerated-node-toleration
dedicated=user-workload:NoSchedule
Taint on GPU nodes


--system-node-selector
nodeGroup=system-pool
Label on non-GPU nodes (from role system-cpu)


5. Stage 1: Snapshot - Capturing System State

5.1 What the Snapshot Command Does

The snapshot command deploys a Kubernetes Job to a GPU node that:

Creates RBAC resources (ServiceAccount, Role, RoleBinding)
Runs a pod on a GPU node
Collects comprehensive system information
Writes results to a ConfigMap
Cleans up the Job (with --cleanup flag)

5.2 The Command I Ran

./dist/cnsctl_darwin_arm64_v8.0/cnsctl snapshot \
    --deploy-agent \
    --namespace gpu-operator \
    --image ghcr.io/mchmarny/cns:latest \
    --node-selector nodeGroup=customer-gpu \
    --cleanup
5.3 Command Breakdown


Flag
Purpose


--deploy-agent
Deploy a Kubernetes Job instead of local snapshot


--namespace gpu-operator
Target namespace for the Job and ConfigMap


--image ghcr.io/mchmarny/cns:latest
Container image for the agent


--node-selector nodeGroup=customer-gpu
Schedule Job on GPU nodes only


--cleanup
Remove Job after completion (keep ConfigMap)


5.4 Output

deploying agent: namespace=gpu-operator
agent deployed successfully
waiting for Job completion: job=cns timeout=5m0s
job completed successfully
snapshot saved to ConfigMap: uri=cm://gpu-operator/cns-snapshot

5.5 What Gets Collected

The snapshot captures four categories of measurements:


Type
Subtypes
Example Data


SystemD
containerd.service
CPUAccounting, MemoryAccounting, CgroupVersion


OS
grub, kmod, sysctl, release
Kernel parameters, loaded modules, OS version


K8s
server, image, policy
Kubernetes version, container images


GPU
smi, driver, device
CUDA version, driver version, GPU model


6. Stage 2: Recipe - Generating Configuration Recommendations

6.1 What the Recipe Command Does

The recipe command:

Loads the snapshot from the ConfigMap
Automatically extracts criteria (service type, GPU type, OS) from the snapshot
Applies CLI overrides (like --intent training)
Matches criteria against embedded overlay rules
Generates optimized component recommendations

6.2 The Command I Ran

./dist/cnsctl_darwin_arm64_v8.0/cnsctl recipe \
    --snapshot cm://gpu-operator/cns-snapshot \
    --intent training \
    --output /tmp/recipe.yaml
6.3 Automatic Criteria Extraction

The recipe command automatically detected from the snapshot:


Criterion
Detected Value
How It Was Detected


service
eks
K8s version string contains -eks-


accelerator
h100
GPU model contains H100


os
ubuntu
/etc/os-release ID field


intent
training
Specified via --intent flag


building recipe from snapshot: criteria=criteria(service=eks, accelerator=h100, intent=training, os=ubuntu)

6.4 The Generated Recipe

kind: recipeResult
apiVersion: cns.nvidia.com/v1alpha1
metadata:
  generatedAt: 2026-01-13T13:05:58.796166Z
  version: v0.17.2-next
  appliedOverlays:
    - h100-eks-training
criteria:
  service: eks
  accelerator: h100
  intent: training
  os: ubuntu
constraints:
  - name: K8s.server.version
    value: '>= 1.30'
  - name: OS.release.ID
    value: ubuntu
  - name: OS.release.VERSION_ID
    value: "24.04"
  - name: OS.sysctl./proc/sys/kernel/osrelease
    value: '>= 6.8'
componentRefs:
  - name: cert-manager
    version: v1.17.2
  - name: gpu-operator
    version: v25.10.1
    overrides:
      driver:
        version: 570.133.20  # Detected from cluster!
  - name: nvsentinel
    version: v0.6.0
  - name: skyhook
    version: v0.4.0
deploymentOrder:
  - cert-manager
  - gpu-operator
  - nvsentinel
  - skyhook
6.5 Key Observations


Overlay Applied: The h100-eks-training overlay was selected based on criteria matching
Version Lock: Driver version 570.133.20 was detected from the running cluster
Deployment Order: Components are ordered by dependencies (cert-manager first)
Constraints: Recipe includes validation constraints for later verification


7. Stage 3: Validate - Checking Compatibility

7.1 What the Validate Command Does

The validate command:

Loads the recipe (constraints)
Loads the snapshot (actual measurements)
Compares each constraint against the actual values
Reports pass/fail/skipped status for each

7.2 The Command I Ran

./dist/cnsctl_darwin_arm64_v8.0/cnsctl validate \
    --recipe /tmp/recipe.yaml \
    --snapshot cm://gpu-operator/cns-snapshot \
    --output /tmp/validation-results.yaml
7.3 Validation Results

summary:
  passed: 4
  failed: 0
  skipped: 0
  total: 4
  status: pass
  duration: 11.5µs
results:
  - name: K8s.server.version
    expected: '>= 1.30'
    actual: v1.34.1-eks-3025e55
    status: passed
  - name: OS.release.ID
    expected: ubuntu
    actual: ubuntu
    status: passed
  - name: OS.release.VERSION_ID
    expected: "24.04"
    actual: "24.04"
    status: passed
  - name: OS.sysctl./proc/sys/kernel/osrelease
    expected: '>= 6.8'
    actual: 6.8.0-1043-aws
    status: passed
7.4 Constraint Types


Operator
Meaning
Example


>=
Greater than or equal (version)
>= 1.30 matches v1.34.1


<=
Less than or equal (version)
<= 1.35


==
Exact match
== ubuntu


!=
Not equal
!= rhel


(none)
Exact string match
ubuntu


8. Stage 4: Bundle - Creating Deployment Artifacts

8.1 What the Bundle Command Does

The bundle command:

Loads the recipe
Invokes registered bundlers for each component
Generates Helm values, manifests, and scripts
Applies node selectors and tolerations
Generates deployer-specific artifacts (ArgoCD, Flux, or scripts)

8.2 The Command I Ran

./dist/cnsctl_darwin_arm64_v8.0/cnsctl bundle \
    --recipe /tmp/recipe.yaml \
    --output /tmp/bundles \
    --system-node-selector nodeGroup=system-pool \
    --accelerated-node-selector nodeGroup=customer-gpu \
    --accelerated-node-toleration dedicated=user-workload:NoSchedule \
    --deployer argocd
8.3 Command Breakdown


Flag
Purpose


--recipe
Path to the generated recipe


--output
Directory for generated artifacts


--system-node-selector
Where to run control-plane components (operator controllers)


--accelerated-node-selector
Where to run GPU workloads (device plugin, driver pods)


--accelerated-node-toleration
Tolerate taints on GPU nodes


--deployer argocd
Generate ArgoCD Application manifests


8.4 Generated Structure

bundles/
├── app-of-apps.yaml              # Parent ArgoCD Application
├── recipe.yaml                   # Copy of input recipe
├── README.md                     # Deployment instructions
├── cert-manager/
│   ├── values.yaml               # Helm values
│   ├── argocd/application.yaml   # ArgoCD Application
│   ├── scripts/{install,uninstall}.sh
│   ├── checksums.txt
│   └── README.md
├── gpu-operator/
│   ├── values.yaml               # Helm values with node selectors
│   ├── argocd/application.yaml   # ArgoCD Application (sync-wave: 1)
│   ├── manifests/                # Additional K8s manifests
│   │   ├── clusterpolicy.yaml
│   │   └── dcgm-exporter.yaml
│   ├── scripts/{install,uninstall}.sh
│   └── README.md
├── nvsentinel/
│   └── ... (similar structure)
└── skyhook/
    └── ... (similar structure)

8.5 Node Selector Application

The bundle command applied the node selectors I specified:
GPU Operator values.yaml (excerpt):
daemonsets:
  nodeSelector:
    nodeGroup: customer-gpu        # My --accelerated-node-selector
  tolerations:
    - effect: NoSchedule
      key: dedicated
      operator: Equal
      value: user-workload         # My --accelerated-node-toleration

operator:
  nodeSelector:
    nodeGroup: system-pool         # My --system-node-selector

9. What Happens After Stage 4?

After generating the bundles, the next steps depend on your deployment method:
9.1 ArgoCD Deployment (What We Generated)

# 1. Push bundles to Git repository
cd /tmp/bundles
git init
git add .
git commit -m "Add CNS bundle"
git remote add origin https://github.com/your-org/your-gitops-repo.git
git push -u origin main

# 2. Update Git URLs in ArgoCD manifests
sed -i 's|<YOUR_GIT_REPO>|https://github.com/your-org/your-gitops-repo.git|g' \
    app-of-apps.yaml \
    */argocd/application.yaml

# 3. Deploy the app-of-apps
kubectl apply -f app-of-apps.yaml

# 4. Monitor deployment
argocd app list
argocd app sync cns-bundle
9.2 Manual Script Deployment (Alternative)

If using --deployer script:
# Deploy each component in order
cd bundles/cert-manager && ./scripts/install.sh
cd ../gpu-operator && ./scripts/install.sh
cd ../nvsentinel && ./scripts/install.sh
cd ../skyhook && ./scripts/install.sh
9.3 Verify Deployment

# Check all pods are running
kubectl get pods -A | grep -E "(gpu-operator|nvsentinel|skyhook|cert-manager)"

# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu.present=true

# Test GPU access
kubectl run gpu-test --rm -it --restart=Never \
    --image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
    --limits=nvidia.com/gpu=1 \
    -- nvidia-smi
9.4 Ongoing Operations

After initial deployment:

Configuration Drift Detection: Periodically run snapshot + validate to detect drift
Upgrades: Generate new recipe when upgrading components, validate, then deploy new bundle
Scaling: Add new GPU nodes; CNS ensures consistent configuration
Troubleshooting: Use snapshots to compare working vs. broken nodes


10. Key Architectural Concepts

10.1 ConfigMap URI Scheme

CNS uses a URI scheme for Kubernetes-native storage:
cm://namespace/name

Examples:

cm://gpu-operator/cns-snapshot - Snapshot stored in ConfigMap
cm://gpu-operator/cns-recipe - Recipe stored in ConfigMap

This enables:

No persistent volumes required
Agent writes directly to Kubernetes API
CLI can read/write across clusters with kubeconfig

10.2 Overlay System

Recipes are built from a base + overlays:
Base Measurements (universal)
    ↓
+ Overlay: service=eks
    ↓
+ Overlay: gpu=h100
    ↓
+ Overlay: intent=training
    ↓
= Final Recipe

10.3 Deployer Pattern

Three deployment methods are available:


Deployer
Generated Artifacts
Use Case


script
Shell scripts
Manual deployment, testing


argocd
ArgoCD Applications
GitOps with ArgoCD


flux
Flux HelmReleases
GitOps with Flux


10.4 Sync Waves (ArgoCD)

ArgoCD uses sync-wave annotations for ordering:
# cert-manager deploys first (wave 0)
annotations:
  argocd.argoproj.io/sync-wave: "0"

# gpu-operator deploys second (wave 1)
annotations:
  argocd.argoproj.io/sync-wave: "1"

Appendix A: ConfigMap Contents

The snapshot ConfigMap contains the complete system state captured from the GPU node.
Viewing the ConfigMap

kubectl -n gpu-operator get cm cns-snapshot -o jsonpath='{.data.snapshot\.yaml}' | yq .
Key Sections of the Snapshot

A.1 Metadata

kind: Snapshot
apiVersion: cns.nvidia.com/v1alpha1
metadata:
    source-node: ip-10-0-182-128.ec2.internal
    timestamp: "2026-01-13T13:02:29Z"
    version: 0.17.2
A.2 OS Release Information

- type: OS
  subtypes:
    - subtype: release
      data:
        ID: ubuntu
        VERSION_ID: "24.04"
        VERSION_CODENAME: noble
        PRETTY_NAME: Ubuntu 24.04.3 LTS
A.3 Kernel Information

    - subtype: sysctl
      data:
        /proc/sys/kernel/osrelease: 6.8.0-1043-aws
        /proc/sys/kernel/hostname: ip-10-0-182-128.ec2.internal
A.4 Kubernetes Server Version

- type: K8s
  subtypes:
    - subtype: server
      data:
        version: v1.34.1-eks-3025e55
        platform: linux/amd64
A.5 GPU Information

- type: GPU
  subtypes:
    - subtype: smi
      data:
        driver-version: "570.133.20"
        cuda-version: "12.8"
        gpu.count: "8"
        gpu.name: NVIDIA H100 80GB HBM3
        gpu.memory: "81559 MiB"

Appendix B: Complete Command Reference

B.1 Building cnsctl from Source

# Prerequisites: Go 1.22+, goreleaser

# Clone repository
git clone https://github.com/mchmarny/cloud-native-stack.git
cd cloud-native-stack

# Build
make build

# Binary location
./dist/cnsctl_darwin_arm64_v8.0/cnsctl -v
B.2 Full E2E Command Sequence

# Set alias for convenience
alias cnsctl='./dist/cnsctl_darwin_arm64_v8.0/cnsctl'

# Stage 1: Snapshot
cnsctl snapshot \
    --deploy-agent \
    --namespace gpu-operator \
    --image ghcr.io/mchmarny/cns:latest \
    --node-selector nodeGroup=customer-gpu \
    --cleanup

# View snapshot
kubectl -n gpu-operator get cm cns-snapshot -o jsonpath='{.data.snapshot\.yaml}' | yq .

# Stage 2: Recipe
cnsctl recipe \
    --snapshot cm://gpu-operator/cns-snapshot \
    --intent training \
    --output recipe.yaml

# Stage 3: Validate
cnsctl validate \
    --recipe recipe.yaml \
    --snapshot cm://gpu-operator/cns-snapshot \
    --output validation-results.yaml

# Stage 4: Bundle
cnsctl bundle \
    --recipe recipe.yaml \
    --output ./bundles \
    --system-node-selector nodeGroup=system-pool \
    --accelerated-node-selector nodeGroup=customer-gpu \
    --accelerated-node-toleration dedicated=user-workload:NoSchedule \
    --deployer argocd
B.3 Alternative: Recipe from Parameters (No Snapshot)

# Generate recipe directly from parameters
cnsctl recipe \
    --service eks \
    --accelerator h100 \
    --os ubuntu \
    --intent training \
    --output recipe.yaml

Appendix C: Generated Bundle Files

C.1 GPU Operator values.yaml (Complete)

# GPU Operator Helm Values
# Generated from Cloud Native Stack Recipe
# Timestamp: 2026-01-13T08:06:32-05:00

cdi:
  default: false
  enabled: true
daemonsets:
  nodeSelector:
    nodeGroup: customer-gpu
  tolerations:
    - effect: NoSchedule
      key: dedicated
      operator: Equal
      value: user-workload
dcgm:
  enabled: true
dcgmExporter:
  config:
    create: true
    name: dcgm-exporter
  serviceMonitor:
    enabled: true
    interval: 60s
devicePlugin:
  env:
    - name: DP_DISABLE_HEALTHCHECKS
      value: "109"
    - name: DEVICE_LIST_STRATEGY
      value: volume-mounts
driver:
  enabled: true
  kernelModuleConfig:
    name: kernel-module-params
  maxParallelUpgrades: 5
  rdma:
    enabled: true
  useOpenKernelModules: true
  version: 570.133.20
gdrcopy:
  enabled: false
  version: v2.5
gfd:
  enabled: true
hostPaths:
  driverInstallDir: /run/nvidia/driver
migManager:
  enabled: true
node-feature-discovery:
  gc:
    nodeSelector:
      nodeGroup: system-pool
    tolerations:
      - operator: Exists
  master:
    nodeSelector:
      nodeGroup: system-pool
    tolerations:
      - operator: Exists
  worker:
    nodeSelector:
      nodeGroup: customer-gpu
    tolerations:
      - effect: NoSchedule
        key: dedicated
        operator: Equal
        value: user-workload
operator:
  nodeSelector:
    nodeGroup: system-pool
  resources:
    limits:
      cpu: 500m
      memory: 700Mi
    requests:
      cpu: 200m
      memory: 300Mi
  tolerations:
    - operator: Exists
  upgradeCRD: true
toolkit:
  enabled: true
  env:
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
      value: "false"
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
      value: "true"
C.2 ArgoCD Application (GPU Operator)

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gpu-operator
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "1"
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  sources:
    # Helm chart source with values from bundle
    - repoURL: 'https://helm.ngc.nvidia.com/nvidia'
      targetRevision: 'v25.10.1'
      chart: gpu-operator
      helm:
        releaseName: gpu-operator
        valueFiles:
          - $values/gpu-operator/values.yaml
    # Reference to Git repository for values.yaml file
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      ref: values
    # Additional manifests from the component's manifests directory
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: gpu-operator/manifests
  destination:
    server: https://kubernetes.default.svc
    namespace: gpu-operator
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
C.3 App-of-Apps (Parent Application)

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cns-bundle
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  sources:
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: cert-manager/argocd
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: gpu-operator/argocd
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: nvsentinel/argocd
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: skyhook/argocd
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Summary

Cloud Native Stack provides a robust, automated approach to GPU infrastructure configuration. The four-stage workflow (Snapshot → Recipe → Validate → Bundle) ensures that:

Configuration is captured accurately from running systems
Recommendations are validated against known-good configurations
Compatibility is verified before deployment
Artifacts are generated ready for GitOps workflows

This approach eliminates manual configuration drift, ensures consistency across environments, and provides a foundation for reliable GPU-accelerated Kubernetes operations.

Report generated by Claude Opus 4.5 analyzing the Cloud Native Stack codebase and executing the E2E demo workflow.
Component	Description
cnsctl	Command-line tool implementing all four workflow stages
cnsd	API server for programmatic access to recipes and bundles
Agent	Kubernetes Job that captures snapshots on GPU nodes
Recipe Data	Embedded database of validated configurations
Bundlers	Plugins that generate deployment artifacts (GPU Operator, NVSentinel, Skyhook, cert-manager)
Deployers	GitOps integration (ArgoCD, Flux, or shell scripts)
Stage	Purpose	Input	Output
Snapshot	Capture system configuration	Live cluster/node	YAML snapshot (file or ConfigMap)
Recipe	Generate optimized configuration	Snapshot or query parameters	Recipe with component versions
Validate	Check cluster compatibility	Recipe + Snapshot	Validation results (pass/fail)
Bundle	Create deployment artifacts	Recipe	Helm values, manifests, scripts
Parameter	Value	How I Found It
`--namespace`	`gpu-operator`	Standard namespace for GPU components
`--node-selector`	`nodeGroup=customer-gpu`	Label on GPU nodes
`--accelerated-node-toleration`	`dedicated=user-workload:NoSchedule`	Taint on GPU nodes
`--system-node-selector`	`nodeGroup=system-pool`	Label on non-GPU nodes (from role `system-cpu`)
Flag	Purpose
`--deploy-agent`	Deploy a Kubernetes Job instead of local snapshot
`--namespace gpu-operator`	Target namespace for the Job and ConfigMap
`--image ghcr.io/mchmarny/cns:latest`	Container image for the agent
`--node-selector nodeGroup=customer-gpu`	Schedule Job on GPU nodes only
`--cleanup`	Remove Job after completion (keep ConfigMap)
Type	Subtypes	Example Data
SystemD	containerd.service	CPUAccounting, MemoryAccounting, CgroupVersion
OS	grub, kmod, sysctl, release	Kernel parameters, loaded modules, OS version
K8s	server, image, policy	Kubernetes version, container images
GPU	smi, driver, device	CUDA version, driver version, GPU model
Criterion	Detected Value	How It Was Detected
service	`eks`	K8s version string contains `-eks-`
accelerator	`h100`	GPU model contains `H100`
os	`ubuntu`	`/etc/os-release` ID field
intent	`training`	Specified via `--intent` flag
Operator	Meaning	Example
`>=`	Greater than or equal (version)	`>= 1.30` matches `v1.34.1`
`<=`	Less than or equal (version)	`<= 1.35`
`==`	Exact match	`== ubuntu`
`!=`	Not equal	`!= rhel`
(none)	Exact string match	`ubuntu`
Flag	Purpose
`--recipe`	Path to the generated recipe
`--output`	Directory for generated artifacts
`--system-node-selector`	Where to run control-plane components (operator controllers)
`--accelerated-node-selector`	Where to run GPU workloads (device plugin, driver pods)
`--accelerated-node-toleration`	Tolerate taints on GPU nodes
`--deployer argocd`	Generate ArgoCD Application manifests
Deployer	Generated Artifacts	Use Case
`script`	Shell scripts	Manual deployment, testing
`argocd`	ArgoCD Applications	GitOps with ArgoCD
`flux`	Flux HelmReleases	GitOps with Flux