This guide outlines the procedure for debugging NVIDIA graphics cards on a Talos Linux node. Due to Talos Linux's immutable and secure nature, direct driver installation and typical debugging steps are not possible. Instead, we leverage Kubernetes features and NVIDIA's containerized drivers.
Procedure:
-
Run the Debug Pod:
Execute the following
kubectlcommand to launch a privileged debug pod on the target Talos Linux node. This pod will contain the NVIDIA driver binaries.kubectl debug \ -n kube-system \ -it \ node/your-node \ --profile=sysadmin \ --image nvcr.io/nvidia/driver:DRIVER_VERSION \ -- /bin/bash
Notes:
- Replace
DRIVER_VERSIONwith the specific NVIDIA driver version compatible with your Talos Linux NVIDIA extension. For Talos Linux 1.10.5, the driver version is570.148.08-ubuntu24.04. - Replace
your-nodewith the actual name of the Talos Linux node you wish to debug.
- Replace
-
Install Driver (without Kernel Module):
Once inside the debug pod's shell, run the installer with the
--no-kernel-moduleflag. This step installs the user-space utilities and libraries without attempting to load a kernel module, as the kernel module should already be provided by the Talos Linux NVIDIA extension../NVIDIA-Linux-x86_64-DRIVER_VERSION.run --no-kernel-module
- Replace
DRIVER_VERSIONwith the exact version from the image name (e.g.,570.148.08).
- Replace
-
Debug with NVIDIA Utilities:
After the driver installation completes within the pod, you can now use NVIDIA utilities like
nvidia-smito inspect the GPU status and troubleshoot.nvidia-smi -q