Skip to content

Instantly share code, notes, and snippets.

@w1ndy
Last active July 11, 2025 10:02
Show Gist options
  • Select an option

  • Save w1ndy/780d8edb645cccfabea9f5211e192c4a to your computer and use it in GitHub Desktop.

Select an option

Save w1ndy/780d8edb645cccfabea9f5211e192c4a to your computer and use it in GitHub Desktop.
Debugging NVIDIA Graphics Cards on Talos Linux

Debugging NVIDIA Graphics Cards on Talos Linux

This guide outlines the procedure for debugging NVIDIA graphics cards on a Talos Linux node. Due to Talos Linux's immutable and secure nature, direct driver installation and typical debugging steps are not possible. Instead, we leverage Kubernetes features and NVIDIA's containerized drivers.

Procedure:

  1. Run the Debug Pod:

    Execute the following kubectl command to launch a privileged debug pod on the target Talos Linux node. This pod will contain the NVIDIA driver binaries.

    kubectl debug \
      -n kube-system \
      -it \
      node/your-node \
      --profile=sysadmin \
      --image nvcr.io/nvidia/driver:DRIVER_VERSION \
      -- /bin/bash

    Notes:

    • Replace DRIVER_VERSION with the specific NVIDIA driver version compatible with your Talos Linux NVIDIA extension. For Talos Linux 1.10.5, the driver version is 570.148.08-ubuntu24.04.
    • Replace your-node with the actual name of the Talos Linux node you wish to debug.
  2. Install Driver (without Kernel Module):

    Once inside the debug pod's shell, run the installer with the --no-kernel-module flag. This step installs the user-space utilities and libraries without attempting to load a kernel module, as the kernel module should already be provided by the Talos Linux NVIDIA extension.

    ./NVIDIA-Linux-x86_64-DRIVER_VERSION.run --no-kernel-module
    • Replace DRIVER_VERSION with the exact version from the image name (e.g., 570.148.08).
  3. Debug with NVIDIA Utilities:

    After the driver installation completes within the pod, you can now use NVIDIA utilities like nvidia-smi to inspect the GPU status and troubleshoot.

    nvidia-smi -q
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment