Multi-host sharing of an Intel Arc Pro GPU with SR-IOV¶
SR-IOV can be used to split a physical device into multiple virtual devices. Device Lending allows these virtual functions to be lent out and borrowed individually. This allows you to for example share a single GPU with all nodes in a cluster. This page shows you how to share an Intel Arc Pro GPU using SR-IOV.
System Requirements¶
A supported Intel Arc Pro B-series GPU
The nodes must run Linux kernel 6.17 or newer to enable Intel Arc Pro SR-IOV.
You must have a Dolphin cluster with a topology supported by SmartIO.
The cluster nodes must follow the platform requirements and the lender must support peer-to-peer.
The cluster nodes must run a supported operating system.
The nodes must have a large enough NTB prefetchable size.
Enabling IOMMU is recommended.
Ensure your Intel Arc GPU is running SR-IOV enabled firmware by looking for the SR-IOV capability in lspci. If you don’t find the capability, the GPU likely needs a firmware upgrade.
# lspci -s 4a: -v
4a:00.0 VGA compatible controller: Intel Corporation Device e212 (prog-if 00 [VGA controller])
Subsystem: Intel Corporation Device 1114
Flags: bus master, fast devsel, latency 0, IRQ 264, IOMMU group 55
Memory at 20a0c000000 (64-bit, prefetchable) [size=16M]
Memory at 20000000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at 84a00000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
Capabilities: [d0] Power Management version 3
Capabilities: [100] Alternative Routing-ID Interpretation (ARI)
Capabilities: [110] Null
Capabilities: [200] Address Translation Service (ATS)
Capabilities: [420] Physical Resizable BAR
Capabilities: [220] Virtual Resizable BAR
Capabilities: [320] Single Root I/O Virtualization (SR-IOV)
Capabilities: [400] Latency Tolerance Reporting
Kernel driver in use: xe
Kernel modules: xe
eXpressWare Installation¶
When installing eXpressWare, make sure to request installation of SmartIO,
either interactively or by passing the --enable-smartio argument. Please refer
to the installation guide for more details.
Creating the Virtual Functions¶
The virtual functions are created at runtime by writing the number of desired VFs to a
file in sysfs. After creating the VFs, the virtual functions will show up in lspci
as additional functions on the same bus as the physical function (PF).
# lspci -s 4a:
4a:00.0 VGA compatible controller: Intel Corporation Device e212
# echo 5 > /sys/bus/pci/devices/0000\:4a\:00.0/sriov_numvfs
# lspci -s 4a:
4a:00.0 VGA compatible controller: Intel Corporation Device e212
4a:00.1 VGA compatible controller: Intel Corporation Device e212
4a:00.2 VGA compatible controller: Intel Corporation Device e212
4a:00.3 VGA compatible controller: Intel Corporation Device e212
4a:00.4 VGA compatible controller: Intel Corporation Device e212
4a:00.5 VGA compatible controller: Intel Corporation Device e212
Hint
sudo is not enough to allow you to write to this file. Either use sudo -i to get a full root shell,
or run: echo 5 | sudo tee /sys/bus/pci/devices/0000\:4a\:00.0/sriov_numvf
Lending the Virtual Functions to the Pool¶
The virtual function can be made available for other nodes in the cluster like
any other PCIe device. Note that you cannot lend the physical function (PF)
while SR-IOV is enabled. The devices that are going to be shared must be added
and made available with smartio_tool add and smartio_tool available. The lender must also be connected to all the borrowers with
smartio_tool connect. See Lending Local Devices for more details.
# smartio_tool add 4a:00.1
# smartio_tool add 4a:00.2
# smartio_tool add 4a:00.3
# smartio_tool add 4a:00.4
# smartio_tool add 4a:00.5
# smartio_tool available --unbind 4a:00.1
# smartio_tool available --unbind 4a:00.2
# smartio_tool available --unbind 4a:00.3
# smartio_tool available --unbind 4a:00.4
# smartio_tool available --unbind 4a:00.5
Borrowing devices from the Pool¶
Devices in the pool can be borrowed by nodes to be used like a local device.
You can list the available devices with smartio_tool list and then
borrow a device with smartio_tool borrow:
# smartio_tool list
44a01: VGA compatible controller Intel Corporation Device e212 [available]
44a02: VGA compatible controller Intel Corporation Device e212 [available]
44a03: VGA compatible controller Intel Corporation Device e212 [available]
44a04: VGA compatible controller Intel Corporation Device e212 [available]
44a05: VGA compatible controller Intel Corporation Device e212 [available]
# smartio_tool borrow 44a01
Name: VGA compatible controller Intel Corporation Device e212
Available: in use
Location: remote
Adapter: 0
NodeId: 4
Remote BDF: 0000:4a:00.1
Physical slot: N/A
Serial Number: 00-00-00-00-00-00-00-00
UUID: 0710779e-9564-46c0-9954-3677413dee79
Vendor ID: 8086
Device ID: e212
Subsystem Vendor ID: 8086
Subsystem Device ID: 1114
Local users: 1
Local virtual device: 0000:05:02.3
Bound to driver: xe
See Using Native Device Drivers for more details.