Using Native Device Drivers¶
Note
Borrowing devices using native and unmodified device drivers is only supported on Linux. See System Requirements or consider alternative ways to use the remote device.
SmartIO enables Composable Disaggregated Infrastructure (CDI) by allowing cluster nodes to borrow PCIe devices from other nodes. Device lending allows unmodified kernel device drivers to access remote devices. For example, using device lending, an NVMe drive installed in one node can be borrowed and mounted on another.
Because PCIe drivers expect exclusive control over a device, SmartIO does not allow the same device to be borrowed by multiple cluster nodes concurrently. The exception to this rule are devices with SR-IOV support. These devices spawn Virtual Functions that are designed to be directly assigned to Virtual Machines. SmartIO enables these Virtual Functions to be shared with other cluster nodes.
Device lending builds on top of SmartIO and allows devices to be borrowed and inserted into the local kernel device tree. This will load the native device driver for the device and signal a hot-add event.
This section describes how to use device lending to borrow remote devices from other nodes. Borrowed devices appears to the borrowing system as local, hot-plugged devices. The device must be unused for borrow to succeed (‘available’ in list). A device that is borrowed for device lending may not be be borrowed by any other node or SISCI application at the same time.
System Requirements¶
The borrower must be a supported platform.
The borrower must run a supported operating system.
The borrower must have a large enough NTB prefetchable size to map the all BARs of the devices to be borrowed.
Additional System Considerations¶
Enabling IOMMU on the borrower is strongly recommended.
The lender must have a large enough NTB prefetchable size to map the configured DMA window. See DMA Window for details.
Borrowing Devices¶
Borrowing remote devices is performed using smartio_tool borrow. This
command needs the ID of the remote device. This ID can be found using
smartio_tool list (see previous section). The borrow command also
takes a second optional parameter that specifies the DMA window size. The DMA
window size controls the amount of memory a device driver can expose to a
borrowed device at any given time. If the window size is not specified, the
default value will be used. The default size is based on the type of device,
number of devices in the pool and the available mapping space on the lender.
See DMA Window for more details.
$ smartio_tool borrow 80000 512
Name: Non-Volatile memory controller Intel Corporation Device f1a5
Local users: 1
Local virtual device: 0000:04:05.0
Bound to driver: nvme
NVMe namespace: nvme0
The command returns once the node has been granted temporary ownership of the device, but depending on the driver, there may be some additional time before the device is ready for use. If the command is successful it will print out some information about the newly borrowed device, for instance it’s corresponding local virtual device and the driver that has taken ownership of the device:
$ ls /dev/nvme0*
/dev/nvme0 /dev/nvme0n1 /dev/nvme0n1p1
$ mount /dev/nvme0n1p1 /mnt
Returning Remote Devices¶
Before returning a device it’s recommended that any local use of the device is stopped in a clean manner. For disk drives, you should unmount any mounted partition on the drive to be returned. This mirrors the preparation that must be made before a device is set as available:
$ umount /mnt
$ smartio_tool return 80000
PCIe Peer-to-Peer¶
PCIe devices can issue DMA operations directly to other PCIe devices, so-called
peer-to-peer (P2P). SmartIO fully supports P2P except for with Fabric Attached Devices. When using SmartIO, P2P must be explicitly enabled using
smartio_tool enable-p2p. This command sets up P2P from a source device
to a receiver device. Often P2P will be bidirectional, in this case you will
need to run smartio_tool enable-p2p two times. Enabling P2P also
requires that the lenders of the devices are connected bidirectionally using
smartio_tool connect.
Peer-to-peer (P2P) allows direct memory tranfers between the devices (zero-copy).¶
Without P2P transfers between devices must be “bounced” via RAM. This incurs a latency penalty. In certain topologies where the link to the CPU/RAM is shared this will also incur a bandwidth reduction.¶
To enable P2P between two borrowed devices:
# On lender with nodeid 4
smartio_tool connect 8 # connect to other lender
smartio_tool connect 12 # connect to borrower
# On lender with nodeid 8
smartio_tool connect 4 # connect to other lender
smartio_tool connect 12 # connect to borrower
# On borrower
smartio_tool borrow A
smartio_tool borrow B
smartio_tool enable-p2p A B # Enable P2P from A to B
smartio_tool enable-p2p B A # Enable P2P from B to A
SmartIO also allows P2P between a local device and a borrowed device and this must also be enabled. Before we can enable P2P however, we must add the local device so we get a fdid for it.
# On borrower
# Make sure the borrower is connected to the lender
smartio_tool connect <lender nodeid>
smartio_tool add <local_bdf>
# The local device will be given an fdid, `<local_fdid>`
smartio_tool list
# Enable P2P from the remote device to the local device
smartio_tool enable-p2p <remote_fdid> <local_fdid>
# Enable P2P from the local device to the remote device
smartio_tool enable-p2p <local_fdid> <remote_fdid>
DMA Window¶
Device drivers and PCIe devices communicate using DMA, allowing the device direct access to memory buffers in system memory. SmartIO allows the same direct access to system memory over the NTB. This allows the device drivers and devices to be unmodified and ensures optimal performance by not needing any additional copy operations. Borrowed devices access DMA buffers through an NTB DMA window. The DMA window is the mapping allowing the device to DMA through the NTB Adapter to directly access the RAM of the borrower. This allows the device direct zero-copy access to buffers allocated by the device driver.
The DMA window consumes prefetchable space on the lenders NTB adapter. The size of the window is set automatically by SmartIO when a device is borrowed. Because of the DMA window we recommend that the NTB prefetchable size on the lender is as large as possible. When the IOMMU is enabled on the borrower, the DMA window size can be adjusted depending on the workload.
IOMMU Enabled on the Borrower¶
The DMA window mechanism works differently depending on if the IOMMU is enabled on the borrower or not. Enabling the IOMMU allows SmartIO to use a smaller DMA window by using the IOMMU to map DMA buffers on the fly as the driver requests it. In this case the DMA window size limits the maximum amount of buffers than can be mapped at any given time. Suitable DMA window size depends on both the device, the target device driver as well as the workload. For example, an idle NVMe drive can work with very little DMA space (i.e 16MB), but under heavy load can need much more (i.e. multiple GBs). GPUs will also need multiple GBs of DMA window.
With IOMMU enabled on the borrower the DMA window can be smaller than the RAM size.¶
Warning
If the DMA window runs out of space at any point, a warning message will be
printed to the kernel logs: No room in IOMMU range. Some target device
drivers will handle out-of-mapping resources gracefully while other may simply
crash, or behave incorrectly.
Automatic DMA Window Size¶
The automatic DMA window size selected by SmartIO uses a heuristic that tries to maximize the DMA window size considering the devices and prefetchable space on the lender. It looks at the free prefetchable space on the lender of the device to be borrowed and then considers any other devices on the lender that have been added but not currently borrowed. The device to be borrowed will then be assigned a fair share of the remaining prefetchable space. Devices known to need a large DMA window like GPUs will automatically be given a larger chunk compared to other devices. Currently the calculation does not take into account that the user may want to enable P2P between the devices after borrowing, which can lead to out of space errors. For more complex workflows or if P2P is enabled, it may be wise to calculate and set the DMA window size manually.
Manual DMA Window Size¶
If the automatic DMA window size is unsuitable, you can set the DMA window size
manually, as long as the IOMMU is enabled on the borrower. Generally you want
to make the DMA window as large as possible without running out of space for
DMA windows or other mappings. The size may be set manually with
smartio_tool borrow when the device is borrowed. The command will
fail if insufficient space is available on the lender.
This step-by-step guide will help you estimate how much of the prefetchable space is available for DMA windows.
1) Find the amount of free prefetchable space on the lender
When a session to a remote node is established, eXpressWare will reserve a
configurable chunk of the prefetchable space for mapping remote memory segments
on that node. The amount of space reserved can be configured by setting
ntb_lut2_switch_max_entries. This setting controls the number of mapping
tables entries used for each remote node, where each entry is
\(\frac{1}{128}\) of the total prefetchable size. It defaults to 12 entries
per remote node, which is also the maximum value. As long as IOMMU is enabled
on the borrower, the reserved range will cover any type of exported segment,
including SmartIO. Exported segment that fit within the reserved range will not
consume additional mapping table entries. The unreserved mapping table entries
will be used on-demand to map any segment not covered by the reserved space on
any node. For larger NTB prefetch sizes and SmartIO use-cases which typically
need large mappings for BARs and DMA windows, ntb_lut2_switch_max_entries can
be decreased from the default value to free up mapping table entries for
on-demand mappings. When calculating the amount of free prefetchable space we
can consider the reserved space “used” since it most likely cannot be used for
a large DMA window. Thus, we can calculate the free prefetchable space as
follows:
2) Calculate the space needed for MSI. Interrupts require a mapping to a fixed physical address and will require one mapping entry per borrower.
3) Calculate the space needed for P2P and BARs. If you’re going to enable P2P from a device on the lender to a device on any other node, we need to account for required prefetch space. To account for alignment restrictions we should round up the size of each BAR to \(\mathrm{mapping\_entry\_size}\). If the lender if also going to borrow any devices, subtract the space needed to map the BARs of the devices to be borrowed in the same way as with P2P mappings.
Source device |
Destination device |
Prefetch space consumption |
|---|---|---|
device in borrower |
device in borrower |
No prefetch space used |
device in lender A |
device in lender A |
No prefetch space used |
device in lender A |
device in borrower |
Uses prefetch space on lender A |
device in lender A |
device in lender B |
Uses prefetch space on lender A |
In the same way as the DMA window, P2P mapping to devices not on the same lender uses NTB prefetchable space to map the BARs of the other device.¶
4) Split the remaining space into DMA windows. We can now calculate the free space that can be used for DMA windows. This space can be freely distributed to the DMA windows of devices borrowed from this lender in the granularity of \(\mathrm{mapping\_entry\_size}\).
Example¶
For example with 3-node cluster, each with 2 GPUs and 128GB prefetchable space where one node borrows two GPUs from the other and sets up P2P between the four GPUs:
1) Calculate the free prefetchable space after space is reserved for the session to the other node:
2) Calculate the space needed for MSI.
3) Calculate the space needed for P2P and BARs.
Using lspci we see that the GPUs have 3 memory BARs of size 64M, 32G and 32M:
$ lspci -s 21:00.0 -v
21:00.0 VGA compatible controller: NVIDIA Corporation Device 2c31 (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 2051
Physical Slot: 2
Flags: bus master, fast devsel, latency 0, IRQ 223, NUMA node 0, IOMMU group 73
Memory at bc000000 (32-bit, non-prefetchable) [size=64M]
Memory at 6f800000000 (64-bit, prefetchable) [size=32G]
Memory at 70012000000 (64-bit, prefetchable) [size=32M]
I/O ports at 4000 [size=128]
Expansion ROM at c0000000 [virtual] [disabled] [size=512K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
4) Split the remaining space into DMA windows.
Hint
The DMA window never needs to be larger than the RAM size of the borrower.
IOMMU Disabled on the Borrower¶
If IOMMU is not enabled on the borrower, the lender must map the entire RAM of the borrower to ensure that the device can access all possible DMA buffers.¶
When the IOMMU is disabled, SmartIO is not able to map buffers on the fly so the DMA window must be large enough to cover the entire RAM of the borrower. Borrowing multiple devices from the same lender consumes no additional prefetchable memory on the lender. On the other hand when multiple borrowers are borrowing devices from a given lender, the lender must be able to map the entire RAM of all borrowers. Because of this we we recommend enabling IOMMU on the borrower, see IOMMU / VT-d.
Warning
When IOMMU is disabled we strongly recommend using the automatic DMA window size. SmartIO will then be able to size the window to cover all of RAM. If the size is picked manually, you must be sure that the device driver never allocates DMA buffers above the DMA window or at least handles mapping errors gracefully.
Hint
When a node has IOMMU disabled, the ntb_lut2_switch_max_entries reserved
range will only cover the the preallocated memory range and can only be used
for memory segments, not SmartIO. The reserved space will be limited to the
size of the preallocated memory (ntb_memory_preallocation_size_mb) on the
remote node.