Skip to content

Instantly share code, notes, and snippets.

@sin3point14
Created June 25, 2025 07:22
Show Gist options
  • Select an option

  • Save sin3point14/4b11f7babd6d90bda18aa5e8fd0072a6 to your computer and use it in GitHub Desktop.

Select an option

Save sin3point14/4b11f7babd6d90bda18aa5e8fd0072a6 to your computer and use it in GitHub Desktop.
Eigen cuda inverse test
#include <Eigen/Dense>
#include <iostream>
using mat = Eigen::Matrix<float, 5, 5>;
template <typename MatrixType>
__device__ void printCudaMatrix(const MatrixType& matrix) {
for (int row = 0; row < matrix.rows(); ++row)
{
printf(" Row %3d: [", row);
for (int col = 0; col < matrix.cols(); ++col)
{
printf("%15.8e ", (double)matrix(row, col));
}
printf("]\n");
}
}
__global__ void matrixInv(mat* A) {
printf("kernel A:\n");
printCudaMatrix(*A);
mat B = (*A).inverse();
printf("kernel B:\n");
printCudaMatrix(B);
*A = B;
printf("kernel Ainv:\n");
printCudaMatrix(*A);
}
int main() {
srand(0);
mat* d_A;
mat h_A = mat::Random();
std::cout << "CPU:\n" << h_A << std::endl;
cudaMalloc((void**)&d_A, sizeof(mat));
cudaMemcpy(d_A, &h_A, sizeof(mat), cudaMemcpyHostToDevice);
matrixInv<<<1, 1>>>(d_A);
cudaDeviceSynchronize();
mat h_Ainv;
cudaMemcpy(&h_Ainv, d_A, sizeof(mat), cudaMemcpyDeviceToHost);
std::cout << "CPU final inv:\n" << h_A.inverse() << std::endl;
std::cout << "GPU final inv:\n" << h_Ainv << std::endl;
cudaFree(d_A);
return 0;
}
@sin3point14
Copy link
Author

sin3point14 commented Jun 25, 2025

using nvcc:

$ nvcc -I/usr/include/eigen3 main.cu
...
$ ./a.out 
CPU:
  0.680375  -0.604897 -0.0452059    0.83239  -0.967399
 -0.211234  -0.329554   0.257742   0.271423  -0.514226
  0.566198   0.536459  -0.270431   0.434594  -0.725537
   0.59688  -0.444451  0.0268018  -0.716795   0.608353
  0.823295    0.10794   0.904459   0.213938  -0.686642
kernel A:
  Row   0: [ 6.80375457e-01 -6.04897261e-01 -4.52058911e-02  8.32390189e-01 -9.67398882e-01 ]
  Row   1: [-2.11234152e-01 -3.29554498e-01  2.57741809e-01  2.71423459e-01 -5.14226437e-01 ]
  Row   2: [ 5.66198468e-01  5.36459208e-01 -2.70431042e-01  4.34593916e-01 -7.25536823e-01 ]
  Row   3: [ 5.96880078e-01 -4.44450557e-01  2.68018246e-02 -7.16794848e-01  6.08353496e-01 ]
  Row   4: [ 8.23294759e-01  1.07939959e-01  9.04459476e-01  2.13937759e-01 -6.86641812e-01 ]
kernel B:
  Row   0: [-6.86641693e-01 -6.86641693e-01 -6.86641693e-01 -6.86641693e-01 -6.86641693e-01 ]
  Row   1: [-6.86641693e-01 -6.86641693e-01 -6.86641693e-01 -6.86641693e-01 -6.86641693e-01 ]
  Row   2: [-6.86641693e-01 -6.86641693e-01 -6.86641693e-01 -6.86641693e-01 -6.86641693e-01 ]
  Row   3: [-6.86641693e-01 -6.86641693e-01 -6.86641693e-01 -6.86641693e-01 -6.86641693e-01 ]
  Row   4: [-6.86641693e-01 -6.86641693e-01 -6.86641693e-01 -6.86641693e-01 -6.86641693e-01 ]
kernel Ainv:
  Row   0: [ 6.80375457e-01 -6.04897261e-01 -4.52058911e-02  8.32390189e-01 -9.67398882e-01 ]
  Row   1: [-2.11234152e-01 -3.29554498e-01  2.57741809e-01  2.71423459e-01 -5.14226437e-01 ]
  Row   2: [ 5.66198468e-01  5.36459208e-01 -2.70431042e-01  4.34593916e-01 -7.25536823e-01 ]
  Row   3: [ 5.96880078e-01 -4.44450557e-01  2.68018246e-02 -7.16794848e-01  6.08353496e-01 ]
  Row   4: [ 8.23294759e-01  1.07939959e-01  9.04459476e-01  2.13937759e-01 -6.86641812e-01 ]
CPU final inv:
   0.47532   -1.06826  0.0343827   0.266967   0.330546
 -0.514907  -0.623726   0.448218  -0.472822   0.300033
-0.0461697  -0.459988  -0.962537  -0.524217   0.962145
   1.46381   -2.99707   -1.72187   -1.73564   0.463828
  0.884239   -2.91862   -1.69268  -0.985516   0.399005
GPU final inv:
  0.680375  -0.604897 -0.0452059    0.83239  -0.967399
 -0.211234  -0.329554   0.257742   0.271423  -0.514226
  0.566198   0.536459  -0.270431   0.434594  -0.725537
   0.59688  -0.444451  0.0268018  -0.716795   0.608353
  0.823295    0.10794   0.904459   0.213938  -0.686642

Eigen inverse gives back the same matrix and we have no compilation/runtime errors

@sin3point14
Copy link
Author

sin3point14 commented Jun 26, 2025

I tried seeing what happens to the generated ptx for various matrix inverses at https://godbolt.org/z/hP6zeEh5E for 1x1...4x4 matrices. I am using the NVCC 12.5.1 compiler with -arch=sm_86 -O2 options. The matrixInv in the PTX view seems to contain some asm which grows in size with increasing dimensions, hence I assume that it is correct. However when I shift to 5x5 the function simply returns:

.visible .entry matrixInv(Eigen::Matrix<float, 5, 5, 0, 5, 5>*)(
	.param .u64 matrixInv(Eigen::Matrix<float, 5, 5, 0, 5, 5>*)_param_0
)
{
	ret;
}

@sin3point14
Copy link
Author

Compiling with clang++ however seems to throw an error as expected. I used the trunk sm_100a CUDA-12.8.1 compiler with -O2. Somehow adding --cuda-gpu-arch=sm_86 throws some error so i just skipped it.

In file included from <source>:1:
In file included from /opt/compiler-explorer/libs/eigen/v3.4.0/Eigen/Dense:2:
In file included from /opt/compiler-explorer/libs/eigen/v3.4.0/Eigen/LU:39:
/opt/compiler-explorer/libs/eigen/v3.4.0/Eigen/src/LU/InverseImpl.h:28:21: error: reference to __host__ function 'partialPivLu' in __host__ __device__ function
   28 |     result = matrix.partialPivLu().inverse();
      |                     ^
...

As expected clang++ refuses to compile iterative functions on the GPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment