Skip to content

Instantly share code, notes, and snippets.

@dims
Created January 15, 2026 13:33
Show Gist options
  • Select an option

  • Save dims/7429660903c79132fe74fba00260d315 to your computer and use it in GitHub Desktop.

Select an option

Save dims/7429660903c79132fe74fba00260d315 to your computer and use it in GitHub Desktop.
Cloud Native Stack (CNS) End-to-End Demo Report

Cloud Native Stack (CNS) End-to-End Demo Report

Date: January 13, 2026 Cluster: AWS EKS (us-east-1) CNS Version: v0.17.2-next Author: Claude Opus 4.5 Analysis


Table of Contents

  1. Executive Summary
  2. What is Cloud Native Stack?
  3. The Four-Stage Workflow
  4. Environment Discovery and Parameter Selection
  5. Stage 1: Snapshot - Capturing System State
  6. Stage 2: Recipe - Generating Configuration Recommendations
  7. Stage 3: Validate - Checking Compatibility
  8. Stage 4: Bundle - Creating Deployment Artifacts
  9. What Happens After Stage 4?
  10. Key Architectural Concepts
  11. Appendix A: ConfigMap Contents
  12. Appendix B: Complete Command Reference
  13. Appendix C: Generated Bundle Files

1. Executive Summary

This report documents the end-to-end workflow of NVIDIA's Cloud Native Stack (CNS) - a suite of tooling designed to automate the complexity of deploying GPU-accelerated Kubernetes infrastructure. We executed all four stages of the CNS workflow on a live AWS EKS cluster with H100 GPUs and successfully:

  • Stage 1: Captured a comprehensive system snapshot from GPU nodes
  • Stage 2: Generated an optimized recipe based on the detected environment
  • Stage 3: Validated the cluster configuration against recipe constraints
  • Stage 4: Generated ArgoCD deployment artifacts ready for GitOps deployment

All stages completed successfully with 4 out of 4 validation constraints passing.


2. What is Cloud Native Stack?

The Problem CNS Solves

Running NVIDIA-accelerated Kubernetes clusters reliably is challenging. Small differences in:

  • Kernel versions
  • GPU drivers
  • Container runtimes
  • Kubernetes releases
  • Operating system configurations

...can cause failures that are difficult to diagnose and expensive to reproduce.

Historically, this knowledge lived in internal validation pipelines, playbooks, and tribal knowledge. Cloud Native Stack externalizes that experience, making validated configurations visible, repeatable, and reusable.

What CNS Is

CNS is a source of validated configuration knowledge for NVIDIA-accelerated Kubernetes environments. It includes:

Component Description
cnsctl Command-line tool implementing all four workflow stages
cnsd API server for programmatic access to recipes and bundles
Agent Kubernetes Job that captures snapshots on GPU nodes
Recipe Data Embedded database of validated configurations
Bundlers Plugins that generate deployment artifacts (GPU Operator, NVSentinel, Skyhook, cert-manager)
Deployers GitOps integration (ArgoCD, Flux, or shell scripts)

What CNS Is NOT

  • Not a Kubernetes distribution
  • Not a cluster provisioning or lifecycle management system
  • Not a managed control plane or hosted service
  • Not a replacement for cloud provider platforms

3. The Four-Stage Workflow

CNS operates through a logical four-stage workflow that transforms raw system state into deployable packages:

┌──────────────┐      ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│   Snapshot   │─────▶│    Recipe    │─────▶│   Validate   │─────▶│    Bundle    │
└──────────────┘      └──────────────┘      └──────────────┘      └──────────────┘
     Stage 1              Stage 2               Stage 3              Stage 4
   (Capture)            (Optimize)             (Check)             (Package)
Stage Purpose Input Output
Snapshot Capture system configuration Live cluster/node YAML snapshot (file or ConfigMap)
Recipe Generate optimized configuration Snapshot or query parameters Recipe with component versions
Validate Check cluster compatibility Recipe + Snapshot Validation results (pass/fail)
Bundle Create deployment artifacts Recipe Helm values, manifests, scripts

4. Environment Discovery and Parameter Selection

Before executing the CNS workflow, I needed to understand the cluster environment to select appropriate parameters. Here's exactly how I discovered the necessary information:

4.1 Cluster Discovery Commands

# 1. Verify cluster connectivity
kubectl cluster-info
# Result: EKS cluster in us-east-1

# 2. List all nodes
kubectl get nodes -o wide
# Found 9 nodes with various roles

# 3. Find GPU nodes specifically
kubectl get nodes -l nvidia.com/gpu.present=true -o wide
# Found 2 nodes: ip-10-0-180-238 and ip-10-0-182-128

4.2 GPU Node Analysis

I examined the GPU nodes in detail:

kubectl describe node ip-10-0-180-238.ec2.internal | grep -E "(nvidia|gpu)" -i

Key findings from node labels:

  • nvidia.com/gpu.present=true - Confirms GPU presence
  • nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3 - H100 GPUs
  • nvidia.com/gpu.count=8 - 8 GPUs per node
  • nvidia.com/cuda.driver-version.full=570.133.20 - Driver version
  • nodeGroup=customer-gpu - This became my node selector
  • dedicated=user-workload - Node taint key/value

4.3 Parameter Selection Rationale

Based on the cluster discovery, I selected these parameters:

Parameter Value How I Found It
--namespace gpu-operator Standard namespace for GPU components
--node-selector nodeGroup=customer-gpu Label on GPU nodes
--accelerated-node-toleration dedicated=user-workload:NoSchedule Taint on GPU nodes
--system-node-selector nodeGroup=system-pool Label on non-GPU nodes (from role system-cpu)

5. Stage 1: Snapshot - Capturing System State

5.1 What the Snapshot Command Does

The snapshot command deploys a Kubernetes Job to a GPU node that:

  1. Creates RBAC resources (ServiceAccount, Role, RoleBinding)
  2. Runs a pod on a GPU node
  3. Collects comprehensive system information
  4. Writes results to a ConfigMap
  5. Cleans up the Job (with --cleanup flag)

5.2 The Command I Ran

./dist/cnsctl_darwin_arm64_v8.0/cnsctl snapshot \
    --deploy-agent \
    --namespace gpu-operator \
    --image ghcr.io/mchmarny/cns:latest \
    --node-selector nodeGroup=customer-gpu \
    --cleanup

5.3 Command Breakdown

Flag Purpose
--deploy-agent Deploy a Kubernetes Job instead of local snapshot
--namespace gpu-operator Target namespace for the Job and ConfigMap
--image ghcr.io/mchmarny/cns:latest Container image for the agent
--node-selector nodeGroup=customer-gpu Schedule Job on GPU nodes only
--cleanup Remove Job after completion (keep ConfigMap)

5.4 Output

deploying agent: namespace=gpu-operator
agent deployed successfully
waiting for Job completion: job=cns timeout=5m0s
job completed successfully
snapshot saved to ConfigMap: uri=cm://gpu-operator/cns-snapshot

5.5 What Gets Collected

The snapshot captures four categories of measurements:

Type Subtypes Example Data
SystemD containerd.service CPUAccounting, MemoryAccounting, CgroupVersion
OS grub, kmod, sysctl, release Kernel parameters, loaded modules, OS version
K8s server, image, policy Kubernetes version, container images
GPU smi, driver, device CUDA version, driver version, GPU model

6. Stage 2: Recipe - Generating Configuration Recommendations

6.1 What the Recipe Command Does

The recipe command:

  1. Loads the snapshot from the ConfigMap
  2. Automatically extracts criteria (service type, GPU type, OS) from the snapshot
  3. Applies CLI overrides (like --intent training)
  4. Matches criteria against embedded overlay rules
  5. Generates optimized component recommendations

6.2 The Command I Ran

./dist/cnsctl_darwin_arm64_v8.0/cnsctl recipe \
    --snapshot cm://gpu-operator/cns-snapshot \
    --intent training \
    --output /tmp/recipe.yaml

6.3 Automatic Criteria Extraction

The recipe command automatically detected from the snapshot:

Criterion Detected Value How It Was Detected
service eks K8s version string contains -eks-
accelerator h100 GPU model contains H100
os ubuntu /etc/os-release ID field
intent training Specified via --intent flag
building recipe from snapshot: criteria=criteria(service=eks, accelerator=h100, intent=training, os=ubuntu)

6.4 The Generated Recipe

kind: recipeResult
apiVersion: cns.nvidia.com/v1alpha1
metadata:
  generatedAt: 2026-01-13T13:05:58.796166Z
  version: v0.17.2-next
  appliedOverlays:
    - h100-eks-training
criteria:
  service: eks
  accelerator: h100
  intent: training
  os: ubuntu
constraints:
  - name: K8s.server.version
    value: '>= 1.30'
  - name: OS.release.ID
    value: ubuntu
  - name: OS.release.VERSION_ID
    value: "24.04"
  - name: OS.sysctl./proc/sys/kernel/osrelease
    value: '>= 6.8'
componentRefs:
  - name: cert-manager
    version: v1.17.2
  - name: gpu-operator
    version: v25.10.1
    overrides:
      driver:
        version: 570.133.20  # Detected from cluster!
  - name: nvsentinel
    version: v0.6.0
  - name: skyhook
    version: v0.4.0
deploymentOrder:
  - cert-manager
  - gpu-operator
  - nvsentinel
  - skyhook

6.5 Key Observations

  1. Overlay Applied: The h100-eks-training overlay was selected based on criteria matching
  2. Version Lock: Driver version 570.133.20 was detected from the running cluster
  3. Deployment Order: Components are ordered by dependencies (cert-manager first)
  4. Constraints: Recipe includes validation constraints for later verification

7. Stage 3: Validate - Checking Compatibility

7.1 What the Validate Command Does

The validate command:

  1. Loads the recipe (constraints)
  2. Loads the snapshot (actual measurements)
  3. Compares each constraint against the actual values
  4. Reports pass/fail/skipped status for each

7.2 The Command I Ran

./dist/cnsctl_darwin_arm64_v8.0/cnsctl validate \
    --recipe /tmp/recipe.yaml \
    --snapshot cm://gpu-operator/cns-snapshot \
    --output /tmp/validation-results.yaml

7.3 Validation Results

summary:
  passed: 4
  failed: 0
  skipped: 0
  total: 4
  status: pass
  duration: 11.5µs
results:
  - name: K8s.server.version
    expected: '>= 1.30'
    actual: v1.34.1-eks-3025e55
    status: passed
  - name: OS.release.ID
    expected: ubuntu
    actual: ubuntu
    status: passed
  - name: OS.release.VERSION_ID
    expected: "24.04"
    actual: "24.04"
    status: passed
  - name: OS.sysctl./proc/sys/kernel/osrelease
    expected: '>= 6.8'
    actual: 6.8.0-1043-aws
    status: passed

7.4 Constraint Types

Operator Meaning Example
>= Greater than or equal (version) >= 1.30 matches v1.34.1
<= Less than or equal (version) <= 1.35
== Exact match == ubuntu
!= Not equal != rhel
(none) Exact string match ubuntu

8. Stage 4: Bundle - Creating Deployment Artifacts

8.1 What the Bundle Command Does

The bundle command:

  1. Loads the recipe
  2. Invokes registered bundlers for each component
  3. Generates Helm values, manifests, and scripts
  4. Applies node selectors and tolerations
  5. Generates deployer-specific artifacts (ArgoCD, Flux, or scripts)

8.2 The Command I Ran

./dist/cnsctl_darwin_arm64_v8.0/cnsctl bundle \
    --recipe /tmp/recipe.yaml \
    --output /tmp/bundles \
    --system-node-selector nodeGroup=system-pool \
    --accelerated-node-selector nodeGroup=customer-gpu \
    --accelerated-node-toleration dedicated=user-workload:NoSchedule \
    --deployer argocd

8.3 Command Breakdown

Flag Purpose
--recipe Path to the generated recipe
--output Directory for generated artifacts
--system-node-selector Where to run control-plane components (operator controllers)
--accelerated-node-selector Where to run GPU workloads (device plugin, driver pods)
--accelerated-node-toleration Tolerate taints on GPU nodes
--deployer argocd Generate ArgoCD Application manifests

8.4 Generated Structure

bundles/
├── app-of-apps.yaml              # Parent ArgoCD Application
├── recipe.yaml                   # Copy of input recipe
├── README.md                     # Deployment instructions
├── cert-manager/
│   ├── values.yaml               # Helm values
│   ├── argocd/application.yaml   # ArgoCD Application
│   ├── scripts/{install,uninstall}.sh
│   ├── checksums.txt
│   └── README.md
├── gpu-operator/
│   ├── values.yaml               # Helm values with node selectors
│   ├── argocd/application.yaml   # ArgoCD Application (sync-wave: 1)
│   ├── manifests/                # Additional K8s manifests
│   │   ├── clusterpolicy.yaml
│   │   └── dcgm-exporter.yaml
│   ├── scripts/{install,uninstall}.sh
│   └── README.md
├── nvsentinel/
│   └── ... (similar structure)
└── skyhook/
    └── ... (similar structure)

8.5 Node Selector Application

The bundle command applied the node selectors I specified:

GPU Operator values.yaml (excerpt):

daemonsets:
  nodeSelector:
    nodeGroup: customer-gpu        # My --accelerated-node-selector
  tolerations:
    - effect: NoSchedule
      key: dedicated
      operator: Equal
      value: user-workload         # My --accelerated-node-toleration

operator:
  nodeSelector:
    nodeGroup: system-pool         # My --system-node-selector

9. What Happens After Stage 4?

After generating the bundles, the next steps depend on your deployment method:

9.1 ArgoCD Deployment (What We Generated)

# 1. Push bundles to Git repository
cd /tmp/bundles
git init
git add .
git commit -m "Add CNS bundle"
git remote add origin https://github.com/your-org/your-gitops-repo.git
git push -u origin main

# 2. Update Git URLs in ArgoCD manifests
sed -i 's|<YOUR_GIT_REPO>|https://github.com/your-org/your-gitops-repo.git|g' \
    app-of-apps.yaml \
    */argocd/application.yaml

# 3. Deploy the app-of-apps
kubectl apply -f app-of-apps.yaml

# 4. Monitor deployment
argocd app list
argocd app sync cns-bundle

9.2 Manual Script Deployment (Alternative)

If using --deployer script:

# Deploy each component in order
cd bundles/cert-manager && ./scripts/install.sh
cd ../gpu-operator && ./scripts/install.sh
cd ../nvsentinel && ./scripts/install.sh
cd ../skyhook && ./scripts/install.sh

9.3 Verify Deployment

# Check all pods are running
kubectl get pods -A | grep -E "(gpu-operator|nvsentinel|skyhook|cert-manager)"

# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu.present=true

# Test GPU access
kubectl run gpu-test --rm -it --restart=Never \
    --image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
    --limits=nvidia.com/gpu=1 \
    -- nvidia-smi

9.4 Ongoing Operations

After initial deployment:

  1. Configuration Drift Detection: Periodically run snapshot + validate to detect drift
  2. Upgrades: Generate new recipe when upgrading components, validate, then deploy new bundle
  3. Scaling: Add new GPU nodes; CNS ensures consistent configuration
  4. Troubleshooting: Use snapshots to compare working vs. broken nodes

10. Key Architectural Concepts

10.1 ConfigMap URI Scheme

CNS uses a URI scheme for Kubernetes-native storage:

cm://namespace/name

Examples:

  • cm://gpu-operator/cns-snapshot - Snapshot stored in ConfigMap
  • cm://gpu-operator/cns-recipe - Recipe stored in ConfigMap

This enables:

  • No persistent volumes required
  • Agent writes directly to Kubernetes API
  • CLI can read/write across clusters with kubeconfig

10.2 Overlay System

Recipes are built from a base + overlays:

Base Measurements (universal)
    ↓
+ Overlay: service=eks
    ↓
+ Overlay: gpu=h100
    ↓
+ Overlay: intent=training
    ↓
= Final Recipe

10.3 Deployer Pattern

Three deployment methods are available:

Deployer Generated Artifacts Use Case
script Shell scripts Manual deployment, testing
argocd ArgoCD Applications GitOps with ArgoCD
flux Flux HelmReleases GitOps with Flux

10.4 Sync Waves (ArgoCD)

ArgoCD uses sync-wave annotations for ordering:

# cert-manager deploys first (wave 0)
annotations:
  argocd.argoproj.io/sync-wave: "0"

# gpu-operator deploys second (wave 1)
annotations:
  argocd.argoproj.io/sync-wave: "1"

Appendix A: ConfigMap Contents

The snapshot ConfigMap contains the complete system state captured from the GPU node.

Viewing the ConfigMap

kubectl -n gpu-operator get cm cns-snapshot -o jsonpath='{.data.snapshot\.yaml}' | yq .

Key Sections of the Snapshot

A.1 Metadata

kind: Snapshot
apiVersion: cns.nvidia.com/v1alpha1
metadata:
    source-node: ip-10-0-182-128.ec2.internal
    timestamp: "2026-01-13T13:02:29Z"
    version: 0.17.2

A.2 OS Release Information

- type: OS
  subtypes:
    - subtype: release
      data:
        ID: ubuntu
        VERSION_ID: "24.04"
        VERSION_CODENAME: noble
        PRETTY_NAME: Ubuntu 24.04.3 LTS

A.3 Kernel Information

    - subtype: sysctl
      data:
        /proc/sys/kernel/osrelease: 6.8.0-1043-aws
        /proc/sys/kernel/hostname: ip-10-0-182-128.ec2.internal

A.4 Kubernetes Server Version

- type: K8s
  subtypes:
    - subtype: server
      data:
        version: v1.34.1-eks-3025e55
        platform: linux/amd64

A.5 GPU Information

- type: GPU
  subtypes:
    - subtype: smi
      data:
        driver-version: "570.133.20"
        cuda-version: "12.8"
        gpu.count: "8"
        gpu.name: NVIDIA H100 80GB HBM3
        gpu.memory: "81559 MiB"

Appendix B: Complete Command Reference

B.1 Building cnsctl from Source

# Prerequisites: Go 1.22+, goreleaser

# Clone repository
git clone https://github.com/mchmarny/cloud-native-stack.git
cd cloud-native-stack

# Build
make build

# Binary location
./dist/cnsctl_darwin_arm64_v8.0/cnsctl -v

B.2 Full E2E Command Sequence

# Set alias for convenience
alias cnsctl='./dist/cnsctl_darwin_arm64_v8.0/cnsctl'

# Stage 1: Snapshot
cnsctl snapshot \
    --deploy-agent \
    --namespace gpu-operator \
    --image ghcr.io/mchmarny/cns:latest \
    --node-selector nodeGroup=customer-gpu \
    --cleanup

# View snapshot
kubectl -n gpu-operator get cm cns-snapshot -o jsonpath='{.data.snapshot\.yaml}' | yq .

# Stage 2: Recipe
cnsctl recipe \
    --snapshot cm://gpu-operator/cns-snapshot \
    --intent training \
    --output recipe.yaml

# Stage 3: Validate
cnsctl validate \
    --recipe recipe.yaml \
    --snapshot cm://gpu-operator/cns-snapshot \
    --output validation-results.yaml

# Stage 4: Bundle
cnsctl bundle \
    --recipe recipe.yaml \
    --output ./bundles \
    --system-node-selector nodeGroup=system-pool \
    --accelerated-node-selector nodeGroup=customer-gpu \
    --accelerated-node-toleration dedicated=user-workload:NoSchedule \
    --deployer argocd

B.3 Alternative: Recipe from Parameters (No Snapshot)

# Generate recipe directly from parameters
cnsctl recipe \
    --service eks \
    --accelerator h100 \
    --os ubuntu \
    --intent training \
    --output recipe.yaml

Appendix C: Generated Bundle Files

C.1 GPU Operator values.yaml (Complete)

# GPU Operator Helm Values
# Generated from Cloud Native Stack Recipe
# Timestamp: 2026-01-13T08:06:32-05:00

cdi:
  default: false
  enabled: true
daemonsets:
  nodeSelector:
    nodeGroup: customer-gpu
  tolerations:
    - effect: NoSchedule
      key: dedicated
      operator: Equal
      value: user-workload
dcgm:
  enabled: true
dcgmExporter:
  config:
    create: true
    name: dcgm-exporter
  serviceMonitor:
    enabled: true
    interval: 60s
devicePlugin:
  env:
    - name: DP_DISABLE_HEALTHCHECKS
      value: "109"
    - name: DEVICE_LIST_STRATEGY
      value: volume-mounts
driver:
  enabled: true
  kernelModuleConfig:
    name: kernel-module-params
  maxParallelUpgrades: 5
  rdma:
    enabled: true
  useOpenKernelModules: true
  version: 570.133.20
gdrcopy:
  enabled: false
  version: v2.5
gfd:
  enabled: true
hostPaths:
  driverInstallDir: /run/nvidia/driver
migManager:
  enabled: true
node-feature-discovery:
  gc:
    nodeSelector:
      nodeGroup: system-pool
    tolerations:
      - operator: Exists
  master:
    nodeSelector:
      nodeGroup: system-pool
    tolerations:
      - operator: Exists
  worker:
    nodeSelector:
      nodeGroup: customer-gpu
    tolerations:
      - effect: NoSchedule
        key: dedicated
        operator: Equal
        value: user-workload
operator:
  nodeSelector:
    nodeGroup: system-pool
  resources:
    limits:
      cpu: 500m
      memory: 700Mi
    requests:
      cpu: 200m
      memory: 300Mi
  tolerations:
    - operator: Exists
  upgradeCRD: true
toolkit:
  enabled: true
  env:
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
      value: "false"
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
      value: "true"

C.2 ArgoCD Application (GPU Operator)

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gpu-operator
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "1"
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  sources:
    # Helm chart source with values from bundle
    - repoURL: 'https://helm.ngc.nvidia.com/nvidia'
      targetRevision: 'v25.10.1'
      chart: gpu-operator
      helm:
        releaseName: gpu-operator
        valueFiles:
          - $values/gpu-operator/values.yaml
    # Reference to Git repository for values.yaml file
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      ref: values
    # Additional manifests from the component's manifests directory
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: gpu-operator/manifests
  destination:
    server: https://kubernetes.default.svc
    namespace: gpu-operator
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

C.3 App-of-Apps (Parent Application)

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cns-bundle
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  sources:
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: cert-manager/argocd
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: gpu-operator/argocd
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: nvsentinel/argocd
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: skyhook/argocd
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Summary

Cloud Native Stack provides a robust, automated approach to GPU infrastructure configuration. The four-stage workflow (Snapshot → Recipe → Validate → Bundle) ensures that:

  1. Configuration is captured accurately from running systems
  2. Recommendations are validated against known-good configurations
  3. Compatibility is verified before deployment
  4. Artifacts are generated ready for GitOps workflows

This approach eliminates manual configuration drift, ensures consistency across environments, and provides a foundation for reliable GPU-accelerated Kubernetes operations.


Report generated by Claude Opus 4.5 analyzing the Cloud Native Stack codebase and executing the E2E demo workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment