Using Native Device Drivers

Note

Borrowing devices using native and unmodified device drivers is only supported on Linux. See System Requirements or consider alternative ways to use the remote device.

SmartIO enables Composable Disaggregated Infrastructure (CDI) by allowing cluster nodes to borrow PCIe devices from other nodes. Device lending allows unmodified kernel device drivers to access remote devices. For example, using device lending, an NVMe drive installed in one node can be borrowed and mounted on another.

Because PCIe drivers expect exclusive control over a device, SmartIO does not allow the same device to be borrowed by multiple cluster nodes concurrently. The exception to this rule are devices with SR-IOV support. These devices spawn Virtual Functions that are designed to be directly assigned to Virtual Machines. SmartIO enables these Virtual Functions to be shared with other cluster nodes.

Device lending builds on top of SmartIO and allows devices to be borrowed and inserted into the local kernel device tree. This will load the native device driver for the device and signal a hot-add event.

This section describes how to use device lending to borrow remote devices from other nodes. Borrowed devices appears to the borrowing system as local, hot-plugged devices. The device must be unused for borrow to succeed (‘available’ in list). A device that is borrowed for device lending may not be be borrowed by any other node or SISCI application at the same time.

System Requirements

Additional System Considerations

Borrowing Devices

Borrowing remote devices is performed using smartio_tool borrow. This command needs the ID of the remote device. This ID can be found using smartio_tool list (see previous section). The borrow command also takes a second optional parameter that specifies the DMA window size. The DMA window size controls the amount of memory a device driver can expose to a borrowed device at any given time. If the window size is not specified, the default value will be used. The default size is based on the type of device, number of devices in the pool and the available mapping space on the lender. See DMA Window for more details.

$ smartio_tool borrow 80000 512
Name: Non-Volatile memory controller Intel Corporation Device f1a5
Local users: 1
Local virtual device: 0000:04:05.0
Bound to driver: nvme
NVMe namespace: nvme0

The command returns once the node has been granted temporary ownership of the device, but depending on the driver, there may be some additional time before the device is ready for use. If the command is successful it will print out some information about the newly borrowed device, for instance it’s corresponding local virtual device and the driver that has taken ownership of the device:

$ ls /dev/nvme0*
/dev/nvme0  /dev/nvme0n1  /dev/nvme0n1p1
$ mount /dev/nvme0n1p1 /mnt

Returning Remote Devices

Before returning a device it’s recommended that any local use of the device is stopped in a clean manner. For disk drives, you should unmount any mounted partition on the drive to be returned. This mirrors the preparation that must be made before a device is set as available:

$ umount /mnt
$ smartio_tool return 80000

PCIe Peer-to-Peer

PCIe devices can issue DMA operations directly to other PCIe devices, so-called peer-to-peer (P2P). SmartIO fully supports P2P except for with Fabric Attached Devices. When using SmartIO, P2P must be explicitly enabled using smartio_tool enable-p2p. This command sets up P2P from a source device to a receiver device. Often P2P will be bidirectional, in this case you will need to run smartio_tool enable-p2p two times. Enabling P2P also requires that the lenders of the devices are connected bidirectionally using smartio_tool connect.

../../_images/smartio-p2p.svg

Peer-to-peer (P2P) allows direct memory tranfers between the devices (zero-copy).

../../_images/smartio-p2p1.svg

Without P2P transfers between devices must be “bounced” via RAM. This incurs a latency penalty. In certain topologies where the link to the CPU/RAM is shared this will also incur a bandwidth reduction.

To enable P2P between two borrowed devices:

# On lender with nodeid 4
smartio_tool connect 8  # connect to other lender
smartio_tool connect 12 # connect to borrower

# On lender with nodeid 8
smartio_tool connect 4  # connect to other lender
smartio_tool connect 12 # connect to borrower

# On borrower
smartio_tool borrow A
smartio_tool borrow B
smartio_tool enable-p2p A B # Enable P2P from A to B
smartio_tool enable-p2p B A # Enable P2P from B to A

SmartIO also allows P2P between a local device and a borrowed device and this must also be enabled. Before we can enable P2P however, we must add the local device so we get a fdid for it.

# On borrower
# Make sure the borrower is connected to the lender
smartio_tool connect <lender nodeid>

smartio_tool add <local_bdf>

# The local device will be given an fdid, `<local_fdid>`
smartio_tool list

# Enable P2P from the remote device to the local device
smartio_tool enable-p2p <remote_fdid> <local_fdid>

# Enable P2P from the local device to the remote device
smartio_tool enable-p2p <local_fdid> <remote_fdid>

DMA Window

Device drivers and PCIe devices communicate using DMA, allowing the device direct access to memory buffers in system memory. SmartIO allows the same direct access to system memory over the NTB. This allows the device drivers and devices to be unmodified and ensures optimal performance by not needing any additional copy operations. Borrowed devices access DMA buffers through an NTB DMA window. The DMA window is the mapping allowing the device to DMA through the NTB Adapter to directly access the RAM of the borrower. This allows the device direct zero-copy access to buffers allocated by the device driver.

The DMA window consumes prefetchable space on the lenders NTB adapter. The size of the window is set automatically by SmartIO when a device is borrowed. Because of the DMA window we recommend that the NTB prefetchable size on the lender is as large as possible. When the IOMMU is enabled on the borrower, the DMA window size can be adjusted depending on the workload.

IOMMU Enabled on the Borrower

The DMA window mechanism works differently depending on if the IOMMU is enabled on the borrower or not. Enabling the IOMMU allows SmartIO to use a smaller DMA window by using the IOMMU to map DMA buffers on the fly as the driver requests it. In this case the DMA window size limits the maximum amount of buffers than can be mapped at any given time. Suitable DMA window size depends on both the device, the target device driver as well as the workload. For example, an idle NVMe drive can work with very little DMA space (i.e 16MB), but under heavy load can need much more (i.e. multiple GBs). GPUs will also need multiple GBs of DMA window.

../../_images/smartio-dma-window.svg

With IOMMU enabled on the borrower the DMA window can be smaller than the RAM size.

Warning

If the DMA window runs out of space at any point, a warning message will be printed to the kernel logs: No room in IOMMU range. Some target device drivers will handle out-of-mapping resources gracefully while other may simply crash, or behave incorrectly.

Automatic DMA Window Size

The automatic DMA window size selected by SmartIO uses a heuristic that tries to maximize the DMA window size considering the devices and prefetchable space on the lender. It looks at the free prefetchable space on the lender of the device to be borrowed and then considers any other devices on the lender that have been added but not currently borrowed. The device to be borrowed will then be assigned a fair share of the remaining prefetchable space. Devices known to need a large DMA window like GPUs will automatically be given a larger chunk compared to other devices. Currently the calculation does not take into account that the user may want to enable P2P between the devices after borrowing, which can lead to out of space errors. For more complex workflows or if P2P is enabled, it may be wise to calculate and set the DMA window size manually.

Manual DMA Window Size

If the automatic DMA window size is unsuitable, you can set the DMA window size manually, as long as the IOMMU is enabled on the borrower. Generally you want to make the DMA window as large as possible without running out of space for DMA windows or other mappings. The size may be set manually with smartio_tool borrow when the device is borrowed. The command will fail if insufficient space is available on the lender.

This step-by-step guide will help you estimate how much of the prefetchable space is available for DMA windows.

1) Find the amount of free prefetchable space on the lender

When a session to a remote node is established, eXpressWare will reserve a configurable chunk of the prefetchable space for mapping remote memory segments on that node. The amount of space reserved can be configured by setting ntb_lut2_switch_max_entries. This setting controls the number of mapping tables entries used for each remote node, where each entry is \(\frac{1}{128}\) of the total prefetchable size. It defaults to 12 entries per remote node, which is also the maximum value. As long as IOMMU is enabled on the borrower, the reserved range will cover any type of exported segment, including SmartIO. Exported segment that fit within the reserved range will not consume additional mapping table entries. The unreserved mapping table entries will be used on-demand to map any segment not covered by the reserved space on any node. For larger NTB prefetch sizes and SmartIO use-cases which typically need large mappings for BARs and DMA windows, ntb_lut2_switch_max_entries can be decreased from the default value to free up mapping table entries for on-demand mappings. When calculating the amount of free prefetchable space we can consider the reserved space “used” since it most likely cannot be used for a large DMA window. Thus, we can calculate the free prefetchable space as follows:

\[\begin{split}\mathrm{mapping\_entry\_size} &= \frac{\mathrm{prefetch_{total}}}{128} \\ \mathrm{reserved} &= \mathrm{mapping\_entry\_size} \cdot \mathrm{ntb\_lut2\_switch\_max\_entries} \\ \mathrm{prefetch_{ondemand}} &= \mathrm{prefetch_{total}} - \mathrm{reserved} \cdot (\mathrm{number\_of\_nodes} - 1)\end{split}\]

2) Calculate the space needed for MSI. Interrupts require a mapping to a fixed physical address and will require one mapping entry per borrower.

\[\mathrm{prefetch_{msi}} = \mathrm{mapping\_entry\_size} \cdot \mathrm{number\_of\_borrowers}\]

3) Calculate the space needed for P2P and BARs. If you’re going to enable P2P from a device on the lender to a device on any other node, we need to account for required prefetch space. To account for alignment restrictions we should round up the size of each BAR to \(\mathrm{mapping\_entry\_size}\). If the lender if also going to borrow any devices, subtract the space needed to map the BARs of the devices to be borrowed in the same way as with P2P mappings.

\[\mathrm{prefetch}_{BARs} = \sum_{d=0}^{N} \sum_{b=0}^{5} \max( \text{mapping\_entry\_size} , \text{BAR}_{d,i} )\]

Source device

Destination device

Prefetch space consumption

device in borrower

device in borrower

No prefetch space used

device in lender A

device in lender A

No prefetch space used

device in lender A

device in borrower

Uses prefetch space on lender A

device in lender A

device in lender B

Uses prefetch space on lender A

../../_images/smartio-dma-window1.svg

In the same way as the DMA window, P2P mapping to devices not on the same lender uses NTB prefetchable space to map the BARs of the other device.

4) Split the remaining space into DMA windows. We can now calculate the free space that can be used for DMA windows. This space can be freely distributed to the DMA windows of devices borrowed from this lender in the granularity of \(\mathrm{mapping\_entry\_size}\).

\[\text{prefetch}_{free} = \text{prefetch}_{ondemand} - \text{prefetch}_{msi} - \mathrm{prefetch}_{BAR}\]
Example

For example with 3-node cluster, each with 2 GPUs and 128GB prefetchable space where one node borrows two GPUs from the other and sets up P2P between the four GPUs:

1) Calculate the free prefetchable space after space is reserved for the session to the other node:

\[\begin{split}\mathrm{mapping\_entry\_size} &= 1\mathrm{GiB} \\ \\ \mathrm{reserved} &= 1\mathrm{GiB} \cdot 12 \\ \mathrm{reserved} &= 12\mathrm{GiB} \\ \\ \mathrm{prefetch_{ondemand}} &= 128\mathrm{GiB} - 2 \cdot 12\mathrm{GiB} \\ \mathrm{prefetch_{ondemand}} &= 104\mathrm{GiB}\end{split}\]

2) Calculate the space needed for MSI.

\[\begin{split}\mathrm{prefetch_{msi}} &= 1\mathrm{GiB} \cdot 2 \\ \mathrm{prefetch_{msi}} &= 2\mathrm{GiB}\end{split}\]

3) Calculate the space needed for P2P and BARs. Using lspci we see that the GPUs have 3 memory BARs of size 64M, 32G and 32M:

$ lspci -s 21:00.0 -v
21:00.0 VGA compatible controller: NVIDIA Corporation Device 2c31 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 2051
        Physical Slot: 2
        Flags: bus master, fast devsel, latency 0, IRQ 223, NUMA node 0, IOMMU group 73
        Memory at bc000000 (32-bit, non-prefetchable) [size=64M]
        Memory at 6f800000000 (64-bit, prefetchable) [size=32G]
        Memory at 70012000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 4000 [size=128]
        Expansion ROM at c0000000 [virtual] [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
\[\begin{split}\mathrm{prefetch}_{BARs} &= 2*(1\mathrm{GiB} + 32\mathrm{GiB} + 1\mathrm{GiB}) \\ \mathrm{prefetch}_{BARs} &= 68\mathrm{GiB}\end{split}\]

4) Split the remaining space into DMA windows.

\[\begin{split}\mathrm{prefetch_{free}} &= \mathrm{prefetch_{ondemand}} - \mathrm{prefetch_{msi}} - \mathrm{prefetch_{BARs}} \\ \mathrm{prefetch_{free}} &= 104\mathrm{GiB} - 2\mathrm{GiB} - 68\mathrm{GiB} \\ \mathrm{prefetch_{free}} &= 34\mathrm{GiB} \\ \\ \mathrm{dma\_window\_size} &= 17\mathrm{GiB} \\\end{split}\]

Hint

The DMA window never needs to be larger than the RAM size of the borrower.

IOMMU Disabled on the Borrower

../../_images/smartio-dma-window2.svg

If IOMMU is not enabled on the borrower, the lender must map the entire RAM of the borrower to ensure that the device can access all possible DMA buffers.

When the IOMMU is disabled, SmartIO is not able to map buffers on the fly so the DMA window must be large enough to cover the entire RAM of the borrower. Borrowing multiple devices from the same lender consumes no additional prefetchable memory on the lender. On the other hand when multiple borrowers are borrowing devices from a given lender, the lender must be able to map the entire RAM of all borrowers. Because of this we we recommend enabling IOMMU on the borrower, see IOMMU / VT-d.

Warning

When IOMMU is disabled we strongly recommend using the automatic DMA window size. SmartIO will then be able to size the window to cover all of RAM. If the size is picked manually, you must be sure that the device driver never allocates DMA buffers above the DMA window or at least handles mapping errors gracefully.

Hint

When a node has IOMMU disabled, the ntb_lut2_switch_max_entries reserved range will only cover the the preallocated memory range and can only be used for memory segments, not SmartIO. The reserved space will be limited to the size of the preallocated memory (ntb_memory_preallocation_size_mb) on the remote node.