Troubleshooting & FAQ

No BAR space

When borrowing a device, the kernel might complain about not enough BAR space:

[  950.330178] pci 0000:01:07.3: BAR 0 [mem size 0x00004000 64bit]: can't assign; no space
[  950.330180] pci 0000:01:07.3: BAR 0 [mem size 0x00004000 64bit]: failed to assign

These messages are cosmetic and can be safely ignored.


NVIDIA GPUs in use

When setting NVIDIA GPU available with smartio_tool available or returning it with smartio_tool return, smartio_tool might hang and the operating system might log this:

NVRM: Attempting to remove device 0000:0f:07.5 with non-zero usage count!

This means that the NVIDIA GPU is still is use by one or more processes. To complete the return, all these processes must be terminated. You might need to do one or more of the following actions:

Kill processes using the device files

$ cat /proc/driver/nvidia/gpus/0000\:07\:00.0/information | grep -i "Device Minor"
Device Minor:    0

Grab the minor number (0 in this case), and list all processes using the device file /dev/nvidia0. You can list them with fuser -v /dev/nvidia0 and kill them with fuser -k /dev/nvidia0.

Disable NVIDIA modesetting

If modesetting is enabled:

$ cat /etc/modprobe.d/nvidia-graphics-drivers-kms.conf 
# This file was generated by nvidia-driver-550
# Set value to 0 to disable modesetting
options nvidia-drm modeset=1

… you can try to reload nvidia-drm with modesetting disabled:

$ sudo modprobe -r nvidia_drm 
$ sudo modprobe nvidia_drm modeset=0

This might release a hanging smartio_tool. You might also need to reboot.

Stop NVIDIA persistence daemon

If you have nvidia-persistenced running, that might be your issue:

$ ps -aux | grep -i "nvidia"
...
nvidia-+    5758  0.0  0.0   5688  2420 ?        Ss   16:53   0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
...

If that’s the case, try to stop and disable the daemon:

$ sudo systemctl disable nvidia-persistenced.service
$ sudo systemctl stop nvidia-persistenced.service

Stop NVIDIA DCGM

If you have nvidia-dcgm running, that might be your issue. Try stopping and disabling the service:

$ sudo systemctl disable nvidia-dcgm.service
$ sudo systemctl stop nvidia-dcgm.service

Interrupts Delivery Issue

Issues with interrupt delivery on borrowed devices will have various symptoms depending on the borrowed device, system configuration and the used platforms. There are multiple known causes for these interrupt issues, keep reading to find a suitable mitigation for your situation.

On borrowed NVMe drives with interrupt delivery issues you may see messages like the following in dmesg:

nvme nvme1: I/O tag 24 (0018) QID 0 timeout, disable controller
nvme nvme1: Device not ready; aborting shutdown, CSTS=0x1
nvme nvme1: Identify Controller failed (-4)
nvme: probe of 0000:01:07.2 failed with error -5

With borrowed NVIDIA GPUs interrupt delivery issues usually result in the following messages in dmesg:

RmInitAdapter: osVerifySystemEnvironment failed, bailing!
GPU 0000:21:08.0: RmInitAdapter failed! (0x11:0x45:2718)

Sometimes, if the interrupt was forwarded, but not to the correct vector, you may see messages such as:

__common_interrupt: 0.37 No irq handler for vector

Intel-based borrower: Source Id Verification

If the borrower is an Intel-based system, you may run into issues with interrupt source-id verification. This will result in the following error messages in dmesg:

DMAR: DRHD: handling fault status reg 2
DMAR: [INTR-REMAP] Request device [01:07.0] fault index 0x2c [fault reason 0x26] Blocked an interrupt request due to source-id verification failure

This can happen when the interrupt is forwarded by the NTB with a different BDF than the local BDF of the borrowed device. This is known to occur when the lender is an AMD EPYC or Threadripper, but may occur on other platforms or in certain NTB topologies.

This can be resolved by disabling source-id verification with the kernel parameter intremap=nosid. Note that disabling interrupt source-id verification may have some security implications (For example a malicious device may trigger an interrupt vector belonging to a different device).

AMD-based borrower: IRT Cache Incoherency

If the borrower is an AMD EPYC or AMD Threadripper system, the interrupt remapper can end up with stale cache entries which can lead to interrupt delivery issues. This issue was fixed in kernel 7.1 and newer.

Upgrading your linux kernel

The AMD IRT cache issue is fixed in kernel 7.1 and newer. It may also be backported to older kernels, look for iommu/amd: Invalidate IRT cache for DMA aliases.

Disabling IRT Cache

On certain AMD EPYC CPUs, starting from Linux kernel v6.4.12, a kernel command-line parameter amd_iommu=irtcachedis works around this issue. You can see if your CPU supports the parameter by checking the kernel log buffer when booting with the parameter provided.

$ sudo dmesg | grep -i iommu
(...)
AMD-Vi: iommu0 (0x0002) : IRT cache is disabled
(...)

If it says IRT cache is disabled, your CPU supports it. If it says IRT cache is enabled, despite booting with the cache disabled, your CPU does not support disabling IRT cache.

Disabling the IRT cache will affect interrupt performance in certains workloads.

Enabling nvidia-persistanced

If borrowing and NVIDIA GPU, enabling the NVIDIA persistance daemon may mitigate the interrupt issue, by keeping the nvidia driver loaded and using the same interrupt vector.

$ sudo systemctl enable --now nvidia-persistenced

Please refer to NVIDIA documentation for more information.

Disabling irqbalance

The IRT cache bug is triggered when the interrupt vector of a borrowed device is changed. Because of this, services like irqbalance can provoke the issue and disabling them may help mitigate the issue. This is only a partial mitigation.

$ sudo systemctl disable irqbalance.service
$ sudo reboot

Nothing is displayed on the monitor attached to the borrowed NVIDIA GPU

To start using the graphical output of a borrowed GPU, we need to reload the nvidia_drm module and the display manager. Substitute gdm in the below commands if you’re using a different display manager.

$ sudo systemctl stop gdm
$ sudo modprobe -r nvidia_drm
$ sudo modprobe nvidia_drm
$ sudo systemctl start gdm

When returning the GPU, you will also need to stop the display manager and unload the nvidia_drm module for the GPU to be returned, then restart the display manager and nvidia_drm to use only the local GPU.