How to install CUDA Toolkit in Debian 12 Bookworm

After installing Nvidia Drivers our next goal is to install CUDA Toolkit.

Step 1: Download the Toolkit

Go to the website cuda-downloads, and select the relevant operating systems and architecture. In this tutorial we are only looking at local download. In my case it is Debian 12 and x86. The downloading site shows the link, which is something like,

wget https://developer.download.nvidia.com/compute/cuda/12.X.Y/local_installers/cuda-repo-debian12-12-X-local_12.X.Y-Z-1_amd64.deb

paste it in the terminal. For my case it is,

wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda-repo-debian12-12-8-local_12.8.1-570.124.06-1_amd64.deb

After which we will get a Debian installation file at the path.

Step 2: Install the Toolkit

Install the Debian file using command,

sudo dpkg -i cuda-repo-debian12-12-8-local_12.8.1-570.124.06-1_amd64.deb

After installation run following command,

sudo cp /var/cuda-repo-debian12-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/

Update the package index and complete the installation,

sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

Step 3: Install CUDA Drivers

The final step is to install the CUDA Drivers. Nvidia provides two flavors of drivers; open kernel module flavor installable via,

sudo apt-get install -y nvidia-open

and proprietary kernel module flavor.

sudo apt-get install -y cuda-drivers

My preference is to install proprietary kernel module flavor.

Step 4: Verifying Installation

We can test the codes available in NVIDIA/cuda-samples repository to verify the installation. We have to clone the repository,

git clone https://github.com/NVIDIA/cuda-samples.git

and run following command to build the examples,

cmake -Bbuild
cmake --build build

which will build all the examples. Run any one of the example,

./build/Samples/0_Introduction/matrixMul/matrixMul

if it passes, in this case,

[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Ada" with compute capability 8.9

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 580.52 GFlop/s, Time= 0.226 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

means CUDA has been installed successfully otherwise revisit the installation steps or official documentation.