I’ve been wanting to try GPU programming for a while.  My non-work laptop, which is now installed in a Windows-10 + Linux Fedora 34 dual boot configuration, has a GPU that I can play with.  lspci -v shows that it is:

GeForce GTX 1660 Ti Mobile

I’m sure this is an underpowered GPU compared to what you’d find in a desktop gaming machine (my stepson has an RTX 3XXX series GPU in his machine, which I’m sure could be made to do much more interesting things — although he thinks it’s for games.)

Setting up the nvidia driver and the cuda SDK on Linux turned out to be a bit more trouble than I figured.  This required:

  1. Installing the cuda SDK.
  2. Building a downlevel gcc version (gcc-10) so that I could run the cuda SDK samples, as Fedora 34 ships with gcc-11, and the SDK doesn’t like that.
  3. Disabling the default nouveau driver
  4. Building and installing a Linux kernel from source, bypassing the default Fedora kernel, which has a debug configuration that enforces GPL symbol purity.
  5. Manually installing the nvidia driver.

Step 1.  Installing the cuda SDK.

The nvidia site has an options dialogue for selecting the packages for your operating system.  The closest I was able to select was Fedora 33, for which the installation instructions were:

wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda-repo-fedora33-11-3-local-11.3.1_465.19.01-1.x86_64.rpm
sudo rpm -i cuda-repo-fedora33-11-3-local-11.3.1_465.19.01-1.x86_64.rpm
sudo dnf clean all
sudo dnf -y module install nvidia-driver:latest-dkms
sudo dnf -y install cuda

Needless to say, this didn’t work.  After installation (and reboot) I was able to create a working copy of the SDK samples using:

/usr/local/cuda-11.3/bin/cuda-install-samples-11.3.sh

Trying to build one of those samples bombs right away, with an errors like:

139 | #error -- unsupported GNU version! gcc versions later than 10 are not supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

Step 2.  Build a downlevel gcc:

git clone git://gcc.gnu.org/git/gcc.git
cd gcc
git checkout releases/gcc-10.3.0
contrib/download_prerequisites
mkdir ../build-gcc
cd ../build-gcc
../gcc/configure --prefix=$(HOME)/gcc-10 --disable-multilib
make -j12
make install

 

With that done, I’m able to compile CUDA samples, but they all fail with cudaGetDeviceCount errors, like so:

matrixMul> ./matrixMul
[Matrix Multiply Using CUDA] - Starting…
CUDA error at …/…/common/inc/helper_cuda.h:779 code=100(cudaErrorNoDevice) “cudaGetDeviceCount(&device_count)”

A bit of googling shows that those errors all mean that the nvidia driver isn’t running or installed properly. This was confirmed by trying to run nvidia-smi which gave me:

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Step 3. Googling suggests that the following might help:

sudo su -
echo "blacklist nouveau" > /etc/modprobe.d/blacklist.conf
dracut -f
reboot

(but it didn’t.) I’m not sure if this step was required or not, but I haven’t undone it.

Step 4. New kernel build and install.

I tried installing the nvidia driver following instructions from JR:

The basic steps are:

  • download the driver .run file.
  • telinit 3 to switch to console mode.
  • try running the NVIDIA-Linux-x86_64-465.31.run installer.
  • Look at /var/log/nvidia-installer.log and see what went wrong.

The first error I found was that the symlink in /lib/modules/5.12.8-300.fc34.x86_64/build was a dead link. I actually seemed to not have matching sources and modules. I upgraded:

sudo yum clean all
sudo yum -y upgrade

to grab matching kernel+sources (figuring there was an update available.) That also didn’t work, because /lib/modules//build pointed to a -debug location that wasn’t available. Correcting that link gave me different errors, namely, gpl symbol errors like:

FATAL: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol ‘mutex_destroy’

(there was a whole pile of similar errors.)

To build a kernel that didn’t have the gpl issues (which apparently comes with the fedora default kernel due to some sort of debug configuration), I ran:

sudo dnf group install "Development Tools"
sudo dnf install ncurses-devel bison flex elfutils-libelf-devel openssl-devel

git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
cd linux
git checkout v5.12.9
cp /boot/config-5.12.9-300.fc34.x86_64 .config
make oldconfig
make -j12
sudo make modules_install
sudo make install
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo grubby --set-default /boot/vmlinuz-5.12.9
reboot

Step 5. Install the nvidia driver manually, last try:

After reboot:

telinit 3
login
/path/to/NVIDIA-Linux-x86_64-465.31.run
reboot

With all this done, I nvidia-smi runs successfully:

Thu Jun 10 00:30:35 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31 Driver Version: 465.31 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 44C P8 5W / N/A | 5MiB / 5944MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 5867 G /usr/libexec/Xorg 4MiB |
+-----------------------------------------------------------------------------+

Now I should be setup to try some CUDA apps (the samples, GPU crypto miners, parallel numerical code, password crackers, or whatever else might be interesting to fool around with.) I’ve got a couple CUDA books on order from the Toronto public library, and will start fooling around in more depth once I get those.