w1ndy/talos-nvidia-debug.md

## talos-nvidia-debug.md

      
    Raw
  

              talos-nvidia-debug.md
            
          
    Debugging NVIDIA Graphics Cards on Talos Linux

This guide outlines the procedure for debugging NVIDIA graphics cards on a Talos Linux node. Due to Talos Linux's immutable and secure nature, direct driver installation and typical debugging steps are not possible. Instead, we leverage Kubernetes features and NVIDIA's containerized drivers.
Procedure:


Run the Debug Pod:
Execute the following kubectl command to launch a privileged debug pod on the target Talos Linux node. This pod will contain the NVIDIA driver binaries.
kubectl debug \
  -n kube-system \
  -it \
  node/your-node \
  --profile=sysadmin \
  --image nvcr.io/nvidia/driver:DRIVER_VERSION \
  -- /bin/bash
Notes:

Replace DRIVER_VERSION with the specific NVIDIA driver version compatible with your Talos Linux NVIDIA extension. For Talos Linux 1.10.5, the driver version is 570.148.08-ubuntu24.04.
Replace your-node with the actual name of the Talos Linux node you wish to debug.


Install Driver (without Kernel Module):
Once inside the debug pod's shell, run the installer with the --no-kernel-module flag. This step installs the user-space utilities and libraries without attempting to load a kernel module, as the kernel module should already be provided by the Talos Linux NVIDIA extension.
./NVIDIA-Linux-x86_64-DRIVER_VERSION.run --no-kernel-module

Replace DRIVER_VERSION with the exact version from the image name (e.g., 570.148.08).


Debug with NVIDIA Utilities:
After the driver installation completes within the pod, you can now use NVIDIA utilities like nvidia-smi to inspect the GPU status and troubleshoot.
nvidia-smi -q
No results found