Troubleshooting & FAQ¶

No BAR space¶

When borrowing a device, the kernel might complain about not enough BAR space:

[  950.330178] pci 0000:01:07.3: BAR 0 [mem size 0x00004000 64bit]: can't assign; no space
[  950.330180] pci 0000:01:07.3: BAR 0 [mem size 0x00004000 64bit]: failed to assign

These messages are cosmetic and can be safely ignored.

Removing device with non-zero usage count¶

When setting NVIDIA GPU available with smartio_tool available or returning it with smartio_tool return, smartio_tool might hang and the operating system might log this:

NVRM: Attempting to remove device 0000:0f:07.5 with non-zero usage count!

This means that the NVIDIA GPU is still is use by one or more processes. To complete the return, all these processes must be terminated. You might need to do one or more of the following actions:

Kill processes using the device files¶

$ cat /proc/driver/nvidia/gpus/0000\:07\:00.0/information | grep -i "Device Minor"
Device Minor:    0

Grab the minor number (0 in this case), and list all processes using the device file /dev/nvidia0. You can list them with fuser -v /dev/nvidia0 and kill them with fuser -k /dev/nvidia0.

Disable NVIDIA modesetting¶

If modesetting is enabled:

$ cat /etc/modprobe.d/nvidia-graphics-drivers-kms.conf 
# This file was generated by nvidia-driver-550
# Set value to 0 to disable modesetting
options nvidia-drm modeset=1

… you can try to reload nvidia-drm with modesetting disabled:

$ sudo modprobe -r nvidia_drm 
$ sudo modprobe nvidia_drm modeset=0

This might release a hanging smartio_tool. You might also need to reboot.

Stop NVIDIA persistence daemon¶

If you have nvidia-persistenced running, that might be your issue:

$ ps -aux | grep -i "nvidia"
...
nvidia-+    5758  0.0  0.0   5688  2420 ?        Ss   16:53   0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
...

If that’s the case, try to stop and disable the daemon:

$ sudo systemctl disable nvidia-persistenced.service
$ sudo systemctl stop nvidia-persistenced.service

Stop NVIDIA DCGM¶

If you have nvidia-dcgm running, that might be your issue. Try stopping and disabling the service:

$ sudo systemctl disable nvidia-dcgm.service
$ sudo systemctl stop nvidia-dcgm.service

No irq handler for vector in console¶

Possible cause 1: irqbalance¶

If you see any of the following messages in the kernel log buffer, irqbalance might be the issue:

$ sudo dmesg
RmInitAdapter: osVerifySystemEnvironment failed, bailing!
GPU 0000:21:08.0: RmInitAdapter failed! (0x11:0x45:2718)
...
__common_interrupt: 0.37 No irq handler for vector

Disable irqbalance, then reboot.

$ sudo systemctl disable irqbalance.service

Possible cause 2: AMD EPYC lender, Intel borrower¶

We have observed an issue when using AMD EPYC CPUs on the lender, caused by an architectural design choice on the lender CPU. If your borrowing side CPU supports the intremap kernel command-line parameter, try booting with intremap=nosid to work around this issue.

Possible cause 3: AMD EPYC lender, AMD borrower¶

We have observed an issue when using AMD EPYC CPUs on the lender, caused by an architectural design choice on the lender CPU and a Linux kernel bug on the borrower. This leads to an interrupt failure on the borrower. On certain AMD EPYC CPUs, starting from Linux kernel v6.4.12, a kernel command-line parameter amd_iommu=irtcachedis works around this issue. You can see if your CPU supports the parameter by checking the kernel log buffer when booting with the parameter provided.

$ sudo dmesg | grep -i iommu
...
AMD-Vi: iommu0 (0x0002) : IRT cache is disabled
...

If it says IRT cache is disabled, your CPU supports it. If it says IRT cache is enabled, despite booting with the cache disabled, your CPU does not support it.

Disabling the IRT cache will affect interrupt performance.

We are working on resolving this issue in future Linux kernel releases. Keep an eye out for future eXpressWare release notes for updates on this issue, or contact us for a kernel patch.

Nothing is displayed on the monitor attached to the borrowed NVIDIA GPU¶

To start using the graphical output of a borrowed GPU, we need to reload the nvidia_drm module and the display manager. Substitute gdm in the below commands if you’re using a different display manager.

$ sudo systemctl stop gdm
$ sudo modprobe -r nvidia_drm
$ sudo modprobe nvidia_drm
$ sudo systemctl start gdm

When returning the GPU, you will also need to stop the display manager and unload the nvidia_drm module for the GPU to be returned, then restart the display manager and nvidia_drm to use only the local GPU.