Using Native Device Drivers¶

Note

Borrowing devices using native and unmodified device drivers is only supported on Linux. See System Requirements or consider alternative ways to use the remote device.

SmartIO enables Composable Disaggregated Infrastructure (CDI) by allowing cluster nodes to borrow PCIe devices from other nodes. Device lending allows unmodified kernel device drivers to access remote devices. For example, using device lending, an NVMe drive installed in one node can be borrowed and mounted on another.

Because PCIe drivers expect exclusive control over a device, SmartIO does not allow the same device to be borrowed by multiple cluster nodes concurrently. The exception to this rule are devices with SR-IOV support. These devices spawn Virtual Functions that are designed to be directly assigned to Virtual Machines. SmartIO enables these Virtual Functions to be shared with other cluster nodes.

Device lending builds on top of SmartIO and allows devices to be borrowed and inserted into the local kernel device tree. This will load the native device driver for the device and signal a hot-add event.

This section describes how to use device lending to borrow remote devices from other nodes. Borrowed devices appears to the borrowing system as local, hot-plugged devices. The device must be unused for borrow to succeed (‘available’ in list). A device that is borrowed for device lending may not be be borrowed by any other node or SISCI application at the same time.

System Requirements¶

The borrower must be a supported platform.
The borrower must run a supported operating system.
The borrower must have a large enough NTB prefetchable size to map the all BARs of the devices to be borrowed.

Additional System Considerations¶

Enabling IOMMU on the borrower is strongly recommended.
The lender must have a large enough NTB prefetchable size to map the configured DMA window. See DMA window size for details.

Borrowing Devices¶

Borrowing remote devices is performed using smartio_tool borrow. This command needs the ID of the remote device. This ID can be found using smartio_tool list (see previous section). The borrow command also takes a second optional parameter that specifies the DMA window size. The DMA window size controls the amount of memory a device driver can expose to a borrowed device at any given time. If the window size is not specified, the default value will be used. The default size is based on the type of device, number of devices in the pool and the available mapping space on the lender. See DMA window size for more details.

$ smartio_tool borrow 80000 512
Name: Non-Volatile memory controller Intel Corporation Device f1a5
Local users: 1
Local virtual device: 0000:04:05.0
Bound to driver: nvme
NVMe namespace: nvme0

The command returns once the node has been granted temporary ownership of the device, but depending on the driver, there may be some additional time before the device is ready for use. If the command is successful it will print out some information about the newly borrowed device, for instance it’s corresponding local virtual device and the driver that has taken ownership of the device:

$ ls /dev/nvme0*
/dev/nvme0  /dev/nvme0n1  /dev/nvme0n1p1
$ mount /dev/nvme0n1p1 /mnt

Returning Remote Devices¶

Before returning a device it’s recommended that any local use of the device is stopped in a clean manner. For disk drives, you should unmount any mounted partition on the drive to be returned. This mirrors the preparation that must be made before a device is set as available:

$ umount /mnt
$ smartio_tool return 80000

DMA window size¶

Device drivers and PCIe devices communicate using DMA, allowing the device direct access to memory buffers in system memory. SmartIO allows the same direct access to system memory over the NTB. This allows the device drivers and devices to be unmodified and ensures optimal performance by not needing any additional copy operations. The DMA window is the mapping allowing the device to DMA through the NTB Adapter to directly access the RAM of the borrower. This allows the device direct zero-copy access to buffers allocated by the device driver. The DMA window consumes the lending side’s mapping resources which is limited by the lending side node’s prefetch space (BAR2). The size of the window is set automatically by SmartIO when a device is borrowed, but this size may be set manually with smartio_tool borrow depending on the user or driver needs. In general, it’s recommended that the prefetchable size be configured as large as practically possible, see Host NTB Adapter Prefetchable Size.

Note

Suitable DMA window size depends on both the device and the target device driver as well as the use of the device. For example, an idle NVMe drive can work with very little DMA space (i.e 16MB), but under heavy load can need much more (i.e. multiple GBs).

Some target device drivers will handle out-of-mapping resources gracefully while other may simply crash, or behave incorrectly.

The DMA window mechanism works differently depending on if the IOMMU is enabled on the borrower or not. Enabling the IOMMU allows SmartIO to use a smaller DMA window by using the IOMMU to map DMA buffers on the fly as the driver requests it. In this case the DMA window size limits the maximum amount of buffers than can be mapped at any given time. The automatic DMA window size selected by SmartIO is a heuristic based on the number of devices and BAR size of the lender. In some cases the DMA size may need to be adjusted depending on the device. For example GPUs may require a large DMA window size for some workloads. If the DMA window runs out of space at any point, a warning message will be printed to the kernel logs: No room in IOMMU range.

In contrast, when the IOMMU is disabled, SmartIO is not able to map buffers on the fly so the DMA window must be large enough to cover the entire RAM of the borrower. Borrowing multiple devices from the same lender consumes no additional prefetchable memory on the lender. On the other hand when multiple borrowers are borrowing devices from a given lender, the lender must be able to map the entire RAM of all borrowers. In general we recommend enabling IOMMU on the borrower, see IOMMU / VT-d.

PCIe peer-to-peer¶

PCIe devices can issue DMA operations directly to other PCIe devices, so-called peer-to-peer (P2P). SmartIO supports this in some topologies and configurations. When using SmartIO, P2P must be explicitly enabled using smartio_tool enable-p2p. This command sets up P2P from a source device to a receiver device. Often P2P will be bidirectional, in this case you will need to run smartio_tool enable-p2p two times. Enabling P2P also requires that the lenders of the devices are connected bidirectionally using smartio_tool connect.

# On lender with nodeid 4
smartio_tool connect 8 # connect to other lender
smartio_tool connect 12 # connect to borrower

# On lender with nodeid 8
smartio_tool connect 4 # connect to other lender
smartio_tool connect 12 # connect to borrower

# On borrower
smartio_tool borrow A
smartio_tool borrow B
smartio_tool enable-p2p A B # Enable P2P from A to B
smartio_tool enable-p2p B A # Enable P2P from B to A

Enabling peer-to-peer between one local and one borrowed device¶

SmartIO also allows P2P between a local device and a borrowed device and this must also be enabled. Before we can enable P2P however, we must add the local device so we get a fdid for it.

# On borrower
# Make sure the borrower is connected to the lender
smartio_tool connect <lender nodeid>

smartio_tool add <local_bdf>

# The local device will be given an fdid, `<local_fdid>`
smartio_tool list

# Enable P2P from the remote device to the local device
smartio_tool enable-p2p <remote_fdid> <local_fdid>

# Enable P2P from the local device to the remote device
smartio_tool enable-p2p <local_fdid> <remote_fdid>