Date: January 13, 2026 Cluster: AWS EKS (us-east-1) CNS Version: v0.17.2-next Author: Claude Opus 4.5 Analysis
- Executive Summary
- What is Cloud Native Stack?
- The Four-Stage Workflow
- Environment Discovery and Parameter Selection
- Stage 1: Snapshot - Capturing System State
- Stage 2: Recipe - Generating Configuration Recommendations
- Stage 3: Validate - Checking Compatibility
- Stage 4: Bundle - Creating Deployment Artifacts
- What Happens After Stage 4?
- Key Architectural Concepts
- Appendix A: ConfigMap Contents
- Appendix B: Complete Command Reference
- Appendix C: Generated Bundle Files
This report documents the end-to-end workflow of NVIDIA's Cloud Native Stack (CNS) - a suite of tooling designed to automate the complexity of deploying GPU-accelerated Kubernetes infrastructure. We executed all four stages of the CNS workflow on a live AWS EKS cluster with H100 GPUs and successfully:
- Stage 1: Captured a comprehensive system snapshot from GPU nodes
- Stage 2: Generated an optimized recipe based on the detected environment
- Stage 3: Validated the cluster configuration against recipe constraints
- Stage 4: Generated ArgoCD deployment artifacts ready for GitOps deployment
All stages completed successfully with 4 out of 4 validation constraints passing.
Running NVIDIA-accelerated Kubernetes clusters reliably is challenging. Small differences in:
- Kernel versions
- GPU drivers
- Container runtimes
- Kubernetes releases
- Operating system configurations
...can cause failures that are difficult to diagnose and expensive to reproduce.
Historically, this knowledge lived in internal validation pipelines, playbooks, and tribal knowledge. Cloud Native Stack externalizes that experience, making validated configurations visible, repeatable, and reusable.
CNS is a source of validated configuration knowledge for NVIDIA-accelerated Kubernetes environments. It includes:
| Component | Description |
|---|---|
| cnsctl | Command-line tool implementing all four workflow stages |
| cnsd | API server for programmatic access to recipes and bundles |
| Agent | Kubernetes Job that captures snapshots on GPU nodes |
| Recipe Data | Embedded database of validated configurations |
| Bundlers | Plugins that generate deployment artifacts (GPU Operator, NVSentinel, Skyhook, cert-manager) |
| Deployers | GitOps integration (ArgoCD, Flux, or shell scripts) |
- Not a Kubernetes distribution
- Not a cluster provisioning or lifecycle management system
- Not a managed control plane or hosted service
- Not a replacement for cloud provider platforms
CNS operates through a logical four-stage workflow that transforms raw system state into deployable packages:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Snapshot │─────▶│ Recipe │─────▶│ Validate │─────▶│ Bundle │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
Stage 1 Stage 2 Stage 3 Stage 4
(Capture) (Optimize) (Check) (Package)
| Stage | Purpose | Input | Output |
|---|---|---|---|
| Snapshot | Capture system configuration | Live cluster/node | YAML snapshot (file or ConfigMap) |
| Recipe | Generate optimized configuration | Snapshot or query parameters | Recipe with component versions |
| Validate | Check cluster compatibility | Recipe + Snapshot | Validation results (pass/fail) |
| Bundle | Create deployment artifacts | Recipe | Helm values, manifests, scripts |
Before executing the CNS workflow, I needed to understand the cluster environment to select appropriate parameters. Here's exactly how I discovered the necessary information:
# 1. Verify cluster connectivity
kubectl cluster-info
# Result: EKS cluster in us-east-1
# 2. List all nodes
kubectl get nodes -o wide
# Found 9 nodes with various roles
# 3. Find GPU nodes specifically
kubectl get nodes -l nvidia.com/gpu.present=true -o wide
# Found 2 nodes: ip-10-0-180-238 and ip-10-0-182-128I examined the GPU nodes in detail:
kubectl describe node ip-10-0-180-238.ec2.internal | grep -E "(nvidia|gpu)" -iKey findings from node labels:
nvidia.com/gpu.present=true- Confirms GPU presencenvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3- H100 GPUsnvidia.com/gpu.count=8- 8 GPUs per nodenvidia.com/cuda.driver-version.full=570.133.20- Driver versionnodeGroup=customer-gpu- This became my node selectordedicated=user-workload- Node taint key/value
Based on the cluster discovery, I selected these parameters:
| Parameter | Value | How I Found It |
|---|---|---|
--namespace |
gpu-operator |
Standard namespace for GPU components |
--node-selector |
nodeGroup=customer-gpu |
Label on GPU nodes |
--accelerated-node-toleration |
dedicated=user-workload:NoSchedule |
Taint on GPU nodes |
--system-node-selector |
nodeGroup=system-pool |
Label on non-GPU nodes (from role system-cpu) |
The snapshot command deploys a Kubernetes Job to a GPU node that:
- Creates RBAC resources (ServiceAccount, Role, RoleBinding)
- Runs a pod on a GPU node
- Collects comprehensive system information
- Writes results to a ConfigMap
- Cleans up the Job (with
--cleanupflag)
./dist/cnsctl_darwin_arm64_v8.0/cnsctl snapshot \
--deploy-agent \
--namespace gpu-operator \
--image ghcr.io/mchmarny/cns:latest \
--node-selector nodeGroup=customer-gpu \
--cleanup| Flag | Purpose |
|---|---|
--deploy-agent |
Deploy a Kubernetes Job instead of local snapshot |
--namespace gpu-operator |
Target namespace for the Job and ConfigMap |
--image ghcr.io/mchmarny/cns:latest |
Container image for the agent |
--node-selector nodeGroup=customer-gpu |
Schedule Job on GPU nodes only |
--cleanup |
Remove Job after completion (keep ConfigMap) |
deploying agent: namespace=gpu-operator
agent deployed successfully
waiting for Job completion: job=cns timeout=5m0s
job completed successfully
snapshot saved to ConfigMap: uri=cm://gpu-operator/cns-snapshot
The snapshot captures four categories of measurements:
| Type | Subtypes | Example Data |
|---|---|---|
| SystemD | containerd.service | CPUAccounting, MemoryAccounting, CgroupVersion |
| OS | grub, kmod, sysctl, release | Kernel parameters, loaded modules, OS version |
| K8s | server, image, policy | Kubernetes version, container images |
| GPU | smi, driver, device | CUDA version, driver version, GPU model |
The recipe command:
- Loads the snapshot from the ConfigMap
- Automatically extracts criteria (service type, GPU type, OS) from the snapshot
- Applies CLI overrides (like
--intent training) - Matches criteria against embedded overlay rules
- Generates optimized component recommendations
./dist/cnsctl_darwin_arm64_v8.0/cnsctl recipe \
--snapshot cm://gpu-operator/cns-snapshot \
--intent training \
--output /tmp/recipe.yamlThe recipe command automatically detected from the snapshot:
| Criterion | Detected Value | How It Was Detected |
|---|---|---|
| service | eks |
K8s version string contains -eks- |
| accelerator | h100 |
GPU model contains H100 |
| os | ubuntu |
/etc/os-release ID field |
| intent | training |
Specified via --intent flag |
building recipe from snapshot: criteria=criteria(service=eks, accelerator=h100, intent=training, os=ubuntu)
kind: recipeResult
apiVersion: cns.nvidia.com/v1alpha1
metadata:
generatedAt: 2026-01-13T13:05:58.796166Z
version: v0.17.2-next
appliedOverlays:
- h100-eks-training
criteria:
service: eks
accelerator: h100
intent: training
os: ubuntu
constraints:
- name: K8s.server.version
value: '>= 1.30'
- name: OS.release.ID
value: ubuntu
- name: OS.release.VERSION_ID
value: "24.04"
- name: OS.sysctl./proc/sys/kernel/osrelease
value: '>= 6.8'
componentRefs:
- name: cert-manager
version: v1.17.2
- name: gpu-operator
version: v25.10.1
overrides:
driver:
version: 570.133.20 # Detected from cluster!
- name: nvsentinel
version: v0.6.0
- name: skyhook
version: v0.4.0
deploymentOrder:
- cert-manager
- gpu-operator
- nvsentinel
- skyhook- Overlay Applied: The
h100-eks-trainingoverlay was selected based on criteria matching - Version Lock: Driver version
570.133.20was detected from the running cluster - Deployment Order: Components are ordered by dependencies (cert-manager first)
- Constraints: Recipe includes validation constraints for later verification
The validate command:
- Loads the recipe (constraints)
- Loads the snapshot (actual measurements)
- Compares each constraint against the actual values
- Reports pass/fail/skipped status for each
./dist/cnsctl_darwin_arm64_v8.0/cnsctl validate \
--recipe /tmp/recipe.yaml \
--snapshot cm://gpu-operator/cns-snapshot \
--output /tmp/validation-results.yamlsummary:
passed: 4
failed: 0
skipped: 0
total: 4
status: pass
duration: 11.5µs
results:
- name: K8s.server.version
expected: '>= 1.30'
actual: v1.34.1-eks-3025e55
status: passed
- name: OS.release.ID
expected: ubuntu
actual: ubuntu
status: passed
- name: OS.release.VERSION_ID
expected: "24.04"
actual: "24.04"
status: passed
- name: OS.sysctl./proc/sys/kernel/osrelease
expected: '>= 6.8'
actual: 6.8.0-1043-aws
status: passed| Operator | Meaning | Example |
|---|---|---|
>= |
Greater than or equal (version) | >= 1.30 matches v1.34.1 |
<= |
Less than or equal (version) | <= 1.35 |
== |
Exact match | == ubuntu |
!= |
Not equal | != rhel |
| (none) | Exact string match | ubuntu |
The bundle command:
- Loads the recipe
- Invokes registered bundlers for each component
- Generates Helm values, manifests, and scripts
- Applies node selectors and tolerations
- Generates deployer-specific artifacts (ArgoCD, Flux, or scripts)
./dist/cnsctl_darwin_arm64_v8.0/cnsctl bundle \
--recipe /tmp/recipe.yaml \
--output /tmp/bundles \
--system-node-selector nodeGroup=system-pool \
--accelerated-node-selector nodeGroup=customer-gpu \
--accelerated-node-toleration dedicated=user-workload:NoSchedule \
--deployer argocd| Flag | Purpose |
|---|---|
--recipe |
Path to the generated recipe |
--output |
Directory for generated artifacts |
--system-node-selector |
Where to run control-plane components (operator controllers) |
--accelerated-node-selector |
Where to run GPU workloads (device plugin, driver pods) |
--accelerated-node-toleration |
Tolerate taints on GPU nodes |
--deployer argocd |
Generate ArgoCD Application manifests |
bundles/
├── app-of-apps.yaml # Parent ArgoCD Application
├── recipe.yaml # Copy of input recipe
├── README.md # Deployment instructions
├── cert-manager/
│ ├── values.yaml # Helm values
│ ├── argocd/application.yaml # ArgoCD Application
│ ├── scripts/{install,uninstall}.sh
│ ├── checksums.txt
│ └── README.md
├── gpu-operator/
│ ├── values.yaml # Helm values with node selectors
│ ├── argocd/application.yaml # ArgoCD Application (sync-wave: 1)
│ ├── manifests/ # Additional K8s manifests
│ │ ├── clusterpolicy.yaml
│ │ └── dcgm-exporter.yaml
│ ├── scripts/{install,uninstall}.sh
│ └── README.md
├── nvsentinel/
│ └── ... (similar structure)
└── skyhook/
└── ... (similar structure)
The bundle command applied the node selectors I specified:
GPU Operator values.yaml (excerpt):
daemonsets:
nodeSelector:
nodeGroup: customer-gpu # My --accelerated-node-selector
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: user-workload # My --accelerated-node-toleration
operator:
nodeSelector:
nodeGroup: system-pool # My --system-node-selectorAfter generating the bundles, the next steps depend on your deployment method:
# 1. Push bundles to Git repository
cd /tmp/bundles
git init
git add .
git commit -m "Add CNS bundle"
git remote add origin https://github.com/your-org/your-gitops-repo.git
git push -u origin main
# 2. Update Git URLs in ArgoCD manifests
sed -i 's|<YOUR_GIT_REPO>|https://github.com/your-org/your-gitops-repo.git|g' \
app-of-apps.yaml \
*/argocd/application.yaml
# 3. Deploy the app-of-apps
kubectl apply -f app-of-apps.yaml
# 4. Monitor deployment
argocd app list
argocd app sync cns-bundleIf using --deployer script:
# Deploy each component in order
cd bundles/cert-manager && ./scripts/install.sh
cd ../gpu-operator && ./scripts/install.sh
cd ../nvsentinel && ./scripts/install.sh
cd ../skyhook && ./scripts/install.sh# Check all pods are running
kubectl get pods -A | grep -E "(gpu-operator|nvsentinel|skyhook|cert-manager)"
# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu.present=true
# Test GPU access
kubectl run gpu-test --rm -it --restart=Never \
--image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
--limits=nvidia.com/gpu=1 \
-- nvidia-smiAfter initial deployment:
- Configuration Drift Detection: Periodically run snapshot + validate to detect drift
- Upgrades: Generate new recipe when upgrading components, validate, then deploy new bundle
- Scaling: Add new GPU nodes; CNS ensures consistent configuration
- Troubleshooting: Use snapshots to compare working vs. broken nodes
CNS uses a URI scheme for Kubernetes-native storage:
cm://namespace/name
Examples:
cm://gpu-operator/cns-snapshot- Snapshot stored in ConfigMapcm://gpu-operator/cns-recipe- Recipe stored in ConfigMap
This enables:
- No persistent volumes required
- Agent writes directly to Kubernetes API
- CLI can read/write across clusters with kubeconfig
Recipes are built from a base + overlays:
Base Measurements (universal)
↓
+ Overlay: service=eks
↓
+ Overlay: gpu=h100
↓
+ Overlay: intent=training
↓
= Final Recipe
Three deployment methods are available:
| Deployer | Generated Artifacts | Use Case |
|---|---|---|
script |
Shell scripts | Manual deployment, testing |
argocd |
ArgoCD Applications | GitOps with ArgoCD |
flux |
Flux HelmReleases | GitOps with Flux |
ArgoCD uses sync-wave annotations for ordering:
# cert-manager deploys first (wave 0)
annotations:
argocd.argoproj.io/sync-wave: "0"
# gpu-operator deploys second (wave 1)
annotations:
argocd.argoproj.io/sync-wave: "1"The snapshot ConfigMap contains the complete system state captured from the GPU node.
kubectl -n gpu-operator get cm cns-snapshot -o jsonpath='{.data.snapshot\.yaml}' | yq .kind: Snapshot
apiVersion: cns.nvidia.com/v1alpha1
metadata:
source-node: ip-10-0-182-128.ec2.internal
timestamp: "2026-01-13T13:02:29Z"
version: 0.17.2- type: OS
subtypes:
- subtype: release
data:
ID: ubuntu
VERSION_ID: "24.04"
VERSION_CODENAME: noble
PRETTY_NAME: Ubuntu 24.04.3 LTS - subtype: sysctl
data:
/proc/sys/kernel/osrelease: 6.8.0-1043-aws
/proc/sys/kernel/hostname: ip-10-0-182-128.ec2.internal- type: K8s
subtypes:
- subtype: server
data:
version: v1.34.1-eks-3025e55
platform: linux/amd64- type: GPU
subtypes:
- subtype: smi
data:
driver-version: "570.133.20"
cuda-version: "12.8"
gpu.count: "8"
gpu.name: NVIDIA H100 80GB HBM3
gpu.memory: "81559 MiB"# Prerequisites: Go 1.22+, goreleaser
# Clone repository
git clone https://github.com/mchmarny/cloud-native-stack.git
cd cloud-native-stack
# Build
make build
# Binary location
./dist/cnsctl_darwin_arm64_v8.0/cnsctl -v# Set alias for convenience
alias cnsctl='./dist/cnsctl_darwin_arm64_v8.0/cnsctl'
# Stage 1: Snapshot
cnsctl snapshot \
--deploy-agent \
--namespace gpu-operator \
--image ghcr.io/mchmarny/cns:latest \
--node-selector nodeGroup=customer-gpu \
--cleanup
# View snapshot
kubectl -n gpu-operator get cm cns-snapshot -o jsonpath='{.data.snapshot\.yaml}' | yq .
# Stage 2: Recipe
cnsctl recipe \
--snapshot cm://gpu-operator/cns-snapshot \
--intent training \
--output recipe.yaml
# Stage 3: Validate
cnsctl validate \
--recipe recipe.yaml \
--snapshot cm://gpu-operator/cns-snapshot \
--output validation-results.yaml
# Stage 4: Bundle
cnsctl bundle \
--recipe recipe.yaml \
--output ./bundles \
--system-node-selector nodeGroup=system-pool \
--accelerated-node-selector nodeGroup=customer-gpu \
--accelerated-node-toleration dedicated=user-workload:NoSchedule \
--deployer argocd# Generate recipe directly from parameters
cnsctl recipe \
--service eks \
--accelerator h100 \
--os ubuntu \
--intent training \
--output recipe.yaml# GPU Operator Helm Values
# Generated from Cloud Native Stack Recipe
# Timestamp: 2026-01-13T08:06:32-05:00
cdi:
default: false
enabled: true
daemonsets:
nodeSelector:
nodeGroup: customer-gpu
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: user-workload
dcgm:
enabled: true
dcgmExporter:
config:
create: true
name: dcgm-exporter
serviceMonitor:
enabled: true
interval: 60s
devicePlugin:
env:
- name: DP_DISABLE_HEALTHCHECKS
value: "109"
- name: DEVICE_LIST_STRATEGY
value: volume-mounts
driver:
enabled: true
kernelModuleConfig:
name: kernel-module-params
maxParallelUpgrades: 5
rdma:
enabled: true
useOpenKernelModules: true
version: 570.133.20
gdrcopy:
enabled: false
version: v2.5
gfd:
enabled: true
hostPaths:
driverInstallDir: /run/nvidia/driver
migManager:
enabled: true
node-feature-discovery:
gc:
nodeSelector:
nodeGroup: system-pool
tolerations:
- operator: Exists
master:
nodeSelector:
nodeGroup: system-pool
tolerations:
- operator: Exists
worker:
nodeSelector:
nodeGroup: customer-gpu
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: user-workload
operator:
nodeSelector:
nodeGroup: system-pool
resources:
limits:
cpu: 500m
memory: 700Mi
requests:
cpu: 200m
memory: 300Mi
tolerations:
- operator: Exists
upgradeCRD: true
toolkit:
enabled: true
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: gpu-operator
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "1"
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
sources:
# Helm chart source with values from bundle
- repoURL: 'https://helm.ngc.nvidia.com/nvidia'
targetRevision: 'v25.10.1'
chart: gpu-operator
helm:
releaseName: gpu-operator
valueFiles:
- $values/gpu-operator/values.yaml
# Reference to Git repository for values.yaml file
- repoURL: <YOUR_GIT_REPO>
targetRevision: main
ref: values
# Additional manifests from the component's manifests directory
- repoURL: <YOUR_GIT_REPO>
targetRevision: main
path: gpu-operator/manifests
destination:
server: https://kubernetes.default.svc
namespace: gpu-operator
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=trueapiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cns-bundle
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
sources:
- repoURL: <YOUR_GIT_REPO>
targetRevision: main
path: cert-manager/argocd
- repoURL: <YOUR_GIT_REPO>
targetRevision: main
path: gpu-operator/argocd
- repoURL: <YOUR_GIT_REPO>
targetRevision: main
path: nvsentinel/argocd
- repoURL: <YOUR_GIT_REPO>
targetRevision: main
path: skyhook/argocd
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=trueCloud Native Stack provides a robust, automated approach to GPU infrastructure configuration. The four-stage workflow (Snapshot → Recipe → Validate → Bundle) ensures that:
- Configuration is captured accurately from running systems
- Recommendations are validated against known-good configurations
- Compatibility is verified before deployment
- Artifacts are generated ready for GitOps workflows
This approach eliminates manual configuration drift, ensures consistency across environments, and provides a foundation for reliable GPU-accelerated Kubernetes operations.
Report generated by Claude Opus 4.5 analyzing the Cloud Native Stack codebase and executing the E2E demo workflow.