Using Native Device Drivers¶
Note
Borrowing devices using native and unmodified device drivers is only supported on Linux. See System Requirements or consider alternative ways to use the remote device.
SmartIO enables Composable Disaggregated Infrastructure (CDI) by allowing cluster nodes to borrow PCIe devices from other nodes. Device lending allows unmodified kernel device drivers to access remote devices. For example, using device lending, an NVMe drive installed in one node can be borrowed and mounted on another.
Because PCIe drivers expect exclusive control over a device, SmartIO does not allow the same device to be borrowed by multiple cluster nodes concurrently. The exception to this rule are devices with SR-IOV support. These devices spawn Virtual Functions that are designed to be directly assigned to Virtual Machines. SmartIO enables these Virtual Functions to be shared with other cluster nodes.
Device lending builds on top of SmartIO and allows devices to be borrowed and inserted into the local kernel device tree. This will load the native device driver for the device and signal a hot-add event.
This section describes how to use device lending to borrow remote devices from other nodes. Borrowed devices appears to the borrowing system as local, hot-plugged devices. The device must be unused for borrow to succeed (‘available’ in list). A device that is borrowed for device lending may not be be borrowed by any other node or SISCI application at the same time.
System Requirements¶
The borrower must be a supported platform.
The borrower must run a supported operating system.
The borrower must have a large enough NTB prefetchable size to map the all BARs of the devices to be borrowed.
Additional System Considerations¶
Enabling IOMMU on the borrower is strongly recommended.
The lender must have a large enough NTB prefetchable size to map the configured DMA window. See DMA window size for details.
Borrowing Devices¶
Borrowing remote devices is performed using smartio_tool borrow. This
command needs the ID of the remote device. This ID can be found using
smartio_tool list (see previous section). The borrow command also
takes a second optional parameter that specifies the DMA window size. The DMA
window size controls the amount of memory a device driver can expose to a
borrowed device at any given time. If the window size is not specified, the
default value will be used. The default size is based on the type of device,
number of devices in the pool and the available mapping space on the lender.
See DMA window size for more details.
$ smartio_tool borrow 80000 512
Name: Non-Volatile memory controller Intel Corporation Device f1a5
Local users: 1
Local virtual device: 0000:04:05.0
Bound to driver: nvme
NVMe namespace: nvme0
The command returns once the node has been granted temporary ownership of the device, but depending on the driver, there may be some additional time before the device is ready for use. If the command is successful it will print out some information about the newly borrowed device, for instance it’s corresponding local virtual device and the driver that has taken ownership of the device:
$ ls /dev/nvme0*
/dev/nvme0 /dev/nvme0n1 /dev/nvme0n1p1
$ mount /dev/nvme0n1p1 /mnt
Returning Remote Devices¶
Before returning a device it’s recommended that any local use of the device is stopped in a clean manner. For disk drives, you should unmount any mounted partition on the drive to be returned. This mirrors the preparation that must be made before a device is set as available:
$ umount /mnt
$ smartio_tool return 80000
DMA window size¶
Device drivers and PCIe devices communicate using DMA, allowing the device
direct access to memory buffers in system memory. SmartIO allows the same
direct access to system memory over the NTB. This allows the device drivers and
devices to be unmodified and ensures optimal performance by not needing any
additional copy operations. The DMA window is the mapping allowing the device
to DMA through the NTB Adapter to directly access the RAM of the borrower. This
allows the device direct zero-copy access to buffers allocated by the device
driver. The DMA window consumes the lending side’s mapping resources which is
limited by the lending side node’s prefetch space (BAR2). The size of the
window is set automatically by SmartIO when a device is borrowed, but this size
may be set manually with smartio_tool borrow depending on the user or
driver needs. In general, it’s recommended that the
prefetchable size be configured as large as practically possible, see
Host NTB Adapter Prefetchable Size.
Note
Suitable DMA window size depends on both the device and the target device driver as well as the use of the device. For example, an idle NVMe drive can work with very little DMA space (i.e 16MB), but under heavy load can need much more (i.e. multiple GBs).
Some target device drivers will handle out-of-mapping resources gracefully while other may simply crash, or behave incorrectly.
The DMA window mechanism works differently depending on if the IOMMU is enabled
on the borrower or not.
Enabling the IOMMU allows SmartIO to use a smaller DMA
window by using the IOMMU to map DMA buffers on the fly as the driver requests
it. In this case the DMA window size limits the maximum amount of buffers than
can be mapped at any given time. The automatic DMA window size selected by
SmartIO is a heuristic based on the number of devices and BAR size of the
lender. In some cases the DMA size may need to be adjusted depending on the
device. For example GPUs may require a large DMA window size for some
workloads. If the DMA window runs out of space at any
point, a warning message will be printed to the kernel logs: No room in IOMMU range.
In contrast, when the IOMMU is disabled, SmartIO is not able to map buffers on the fly so the DMA window must be large enough to cover the entire RAM of the borrower. Borrowing multiple devices from the same lender consumes no additional prefetchable memory on the lender. On the other hand when multiple borrowers are borrowing devices from a given lender, the lender must be able to map the entire RAM of all borrowers. In general we recommend enabling IOMMU on the borrower, see IOMMU / VT-d.
PCIe peer-to-peer¶
PCIe devices can issue DMA operations directly to other PCIe devices, so-called
peer-to-peer (P2P). SmartIO supports this in some topologies and
configurations. When using SmartIO, P2P must be explicitly enabled using
smartio_tool enable-p2p. This command sets up P2P from a source device
to a receiver device. Often P2P will be bidirectional, in this case you will
need to run smartio_tool enable-p2p two times. Enabling P2P also
requires that the lenders of the devices are connected bidirectionally using
smartio_tool connect.
# On lender with nodeid 4
smartio_tool connect 8 # connect to other lender
smartio_tool connect 12 # connect to borrower
# On lender with nodeid 8
smartio_tool connect 4 # connect to other lender
smartio_tool connect 12 # connect to borrower
# On borrower
smartio_tool borrow A
smartio_tool borrow B
smartio_tool enable-p2p A B # Enable P2P from A to B
smartio_tool enable-p2p B A # Enable P2P from B to A
Enabling peer-to-peer between one local and one borrowed device¶
SmartIO also allows P2P between a local device and a borrowed device and this must also be enabled. Before we can enable P2P however, we must add the local device so we get a fdid for it.
# On borrower
# Make sure the borrower is connected to the lender
smartio_tool connect <lender nodeid>
smartio_tool add <local_bdf>
# The local device will be given an fdid, `<local_fdid>`
smartio_tool list
# Enable P2P from the remote device to the local device
smartio_tool enable-p2p <remote_fdid> <local_fdid>
# Enable P2P from the local device to the remote device
smartio_tool enable-p2p <local_fdid> <remote_fdid>