Troubleshooting & FAQ¶
No BAR space¶
When borrowing a device, the kernel might complain about not enough BAR space:
[ 950.330178] pci 0000:01:07.3: BAR 0 [mem size 0x00004000 64bit]: can't assign; no space
[ 950.330180] pci 0000:01:07.3: BAR 0 [mem size 0x00004000 64bit]: failed to assign
These messages are cosmetic and can be safely ignored.
NVIDIA GPUs in use¶
When setting NVIDIA GPU available with smartio_tool available or returning it
with smartio_tool return, smartio_tool might hang and the operating system
might log this:
NVRM: Attempting to remove device 0000:0f:07.5 with non-zero usage count!
This means that the NVIDIA GPU is still is use by one or more processes. To complete the return, all these processes must be terminated. You might need to do one or more of the following actions:
Kill processes using the device files¶
$ cat /proc/driver/nvidia/gpus/0000\:07\:00.0/information | grep -i "Device Minor"
Device Minor: 0
Grab the minor number (0 in this case), and list all processes using the device file /dev/nvidia0. You can list them with fuser -v /dev/nvidia0 and kill them with fuser -k /dev/nvidia0.
Disable NVIDIA modesetting¶
If modesetting is enabled:
$ cat /etc/modprobe.d/nvidia-graphics-drivers-kms.conf
# This file was generated by nvidia-driver-550
# Set value to 0 to disable modesetting
options nvidia-drm modeset=1
… you can try to reload nvidia-drm with modesetting disabled:
$ sudo modprobe -r nvidia_drm
$ sudo modprobe nvidia_drm modeset=0
This might release a hanging smartio_tool. You might also need to reboot.
Stop NVIDIA persistence daemon¶
If you have nvidia-persistenced running, that might be your issue:
$ ps -aux | grep -i "nvidia"
...
nvidia-+ 5758 0.0 0.0 5688 2420 ? Ss 16:53 0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
...
If that’s the case, try to stop and disable the daemon:
$ sudo systemctl disable nvidia-persistenced.service
$ sudo systemctl stop nvidia-persistenced.service
Stop NVIDIA DCGM¶
If you have nvidia-dcgm running, that might be your issue. Try stopping and disabling the service:
$ sudo systemctl disable nvidia-dcgm.service
$ sudo systemctl stop nvidia-dcgm.service
Interrupts Delivery Issue¶
Issues with interrupt delivery on borrowed devices will have various symptoms depending on the borrowed device, system configuration and the used platforms. There are multiple known causes for these interrupt issues, keep reading to find a suitable mitigation for your situation.
On borrowed NVMe drives with interrupt delivery issues you may see messages
like the following in dmesg:
nvme nvme1: I/O tag 24 (0018) QID 0 timeout, disable controller
nvme nvme1: Device not ready; aborting shutdown, CSTS=0x1
nvme nvme1: Identify Controller failed (-4)
nvme: probe of 0000:01:07.2 failed with error -5
With borrowed NVIDIA GPUs interrupt delivery issues usually result in the
following messages in dmesg:
RmInitAdapter: osVerifySystemEnvironment failed, bailing!
GPU 0000:21:08.0: RmInitAdapter failed! (0x11:0x45:2718)
Sometimes, if the interrupt was forwarded, but not to the correct vector, you may see messages such as:
__common_interrupt: 0.37 No irq handler for vector
Intel-based borrower: Source Id Verification¶
If the borrower is an Intel-based system, you may run into issues with
interrupt source-id verification. This will result in the following error
messages in dmesg:
DMAR: DRHD: handling fault status reg 2
DMAR: [INTR-REMAP] Request device [01:07.0] fault index 0x2c [fault reason 0x26] Blocked an interrupt request due to source-id verification failure
This can happen when the interrupt is forwarded by the NTB with a different BDF than the local BDF of the borrowed device. This is known to occur when the lender is an AMD EPYC or Threadripper, but may occur on other platforms or in certain NTB topologies.
This can be resolved by disabling source-id verification with the kernel
parameter intremap=nosid. Note that disabling interrupt source-id
verification may have some security implications (For example a malicious
device may trigger an interrupt vector belonging to a different device).
AMD-based borrower: IRT Cache Incoherency¶
If the borrower is an AMD EPYC or AMD Threadripper system, the interrupt remapper can end up with stale cache entries which can lead to interrupt delivery issues. This issue was fixed in kernel 7.1 and newer.
Upgrading your linux kernel¶
The AMD IRT cache issue is fixed in kernel 7.1 and newer. It may also be backported to older kernels, look for iommu/amd: Invalidate IRT cache for DMA aliases.
Disabling IRT Cache¶
On certain AMD EPYC CPUs, starting from Linux kernel v6.4.12, a kernel
command-line parameter amd_iommu=irtcachedis works around this issue. You can
see if your CPU supports the parameter by checking the kernel log buffer when
booting with the parameter provided.
$ sudo dmesg | grep -i iommu
(...)
AMD-Vi: iommu0 (0x0002) : IRT cache is disabled
(...)
If it says IRT cache is disabled, your CPU supports it. If it says IRT cache is enabled, despite booting with the cache disabled, your CPU does not support
disabling IRT cache.
Disabling the IRT cache will affect interrupt performance in certains workloads.
Enabling nvidia-persistanced¶
If borrowing and NVIDIA GPU, enabling the NVIDIA persistance daemon may mitigate the interrupt issue, by keeping the nvidia driver loaded and using the same interrupt vector.
$ sudo systemctl enable --now nvidia-persistenced
Please refer to NVIDIA documentation for more information.
Disabling irqbalance¶
The IRT cache bug is triggered when the interrupt vector of a borrowed device
is changed. Because of this, services like irqbalance can provoke the issue
and disabling them may help mitigate the issue. This is only a partial
mitigation.
$ sudo systemctl disable irqbalance.service
$ sudo reboot
Nothing is displayed on the monitor attached to the borrowed NVIDIA GPU¶
To start using the graphical output of a borrowed GPU, we need to reload the
nvidia_drm module and the display manager. Substitute gdm in the below
commands if you’re using a different display manager.
$ sudo systemctl stop gdm
$ sudo modprobe -r nvidia_drm
$ sudo modprobe nvidia_drm
$ sudo systemctl start gdm
When returning the GPU, you will also need to stop the display manager and
unload the nvidia_drm module for the GPU to be returned, then restart the
display manager and nvidia_drm to use only the local GPU.