Using Dolphin’s dis_nvme driver¶
The SmartIO NVMe driver is a prototype NVMe driver implementation that enables simultaneous sharing of NVMe devices to multiple hosts, without requiring that the device supports virtualization such as SR-IOV. NVMe devices registered with SmartIO show up as a block device on all hosts running the driver.
Disclaimer¶
The dis_nvme solution should be considered being in a development version, and is not yet a finalized product. Dolphin does not provide any guarantees with regards to functionality or data integrity.
Further details about how the NVMe driver works can be found in the Multi-Host Sharing of a Single-Function NVMe Device in a PCIe Cluster paper.
If you want to use dis_nvme in production, please contact us.
System Requirements¶
Development versions for supported kernels can be downloaded here: https://dl.dolphinics.com/ci/nvme/. Use the same username and password as for downloading the eXpressWare driver. You need to run the same version of eXpressWare (an installer is bundled in the NVMe driver download).
The cluster nodes must be a supported platform.
The cluster nodes must run a supported Linux kernel.
Additional System Considerations¶
The NTB prefetchable size on the lender limits the amount of RAM the NVMe drive can access at any given moment. A small prefetchable size can negatively impact performance.
Loading the drivers¶
The implementation is composed of two drivers, namely a manager and one or more clients. An NVMe is managed by one manager, but multiple clients may use it at the same time. The manager driver is responsible for resetting an NVMe device, configuring the admin queues used for managing the device. It executes admin commands on behalf of the clients, and keeps track of distributed I/O queues. The manager is implemented as a stand-alone driver called dis_nvme_manager.
Client drivers request I/O queue pairs from the manager, and are able to operate the NVMe independently through these. Clients are also responsible for presenting their respective nodes with a block device interface, which provides block access to file systems and applications on the nodes. The client driver is implemented as the dis_nvme driver.
Load the dis_nvme manager on one of the cluster nodes by running (one host only):
$ insmod dis_nvme_manager.ko
On all hosts that will
need access to the shared NVMe drive, load the dis_nvme client module.
The modules will automatically discover all available NVMe drives and all nodes will get a new
/dev/disnvme### for each available NVMe drive. At this point you can use the
NVMe drives as any other shared disk. Note that dis_nvme only provides shared
access to the drives and that a shared-disk file system (such as GFS2) must
be used to allow shared access to a single filesystem.
Loading the client driver is done by running (all hosts):
$ insmod dis_nvme.ko
If done correctly, the disk will be automatically shared when the driver starts and should appear in lsblk.
$ lsblk
...
dis80b00n1 249:1 0 232.9G 0 disk
dis80800n1 250:1 0 1.8T 0 disk
It is now possible to format /dev/dis#### block devices with a file system and partitions, for
example with a shared-disk file system such as GFS2 or OCFS2. In our lab, we have done some
preliminary testing with the following file systems.
GFS2
OCFS2
ext4 (mounted read-only)
Particularly for GFS2, we recommend following the Red Hat Enterprise Linux guide on setting it up with pacemaker.
Hint
We recommend using Dolphin’s TCP/IP driver dis-ip one of its interfaces as the network interface/connection for shared disk file system lock managers.
See the documentation for the Dolphin TCP/IP driver or the product page for TCP/IP over PCI Express (IPoPCIe) for more information.
Restarting Dolphin drivers¶
If restarting the eXpressWare drivers is necessary, the dis_nvme drivers must
first be unloaded with rmmod. Before unloading the drivers, any mounted
file systems must be unmounted.
To see mount points, use the mount command
$ mount | grep "^/dev/dis"
/dev/dis40300n1 on /mnt/share type ocfs2 (rw,...)
Managing NVMe devices¶
Managing NVMe devices is done through the smartio_tool utility program that is
included in Dolphin’s eXpressWare suite. See the documentation for Lending Local Devices
or the smartio_tool documentation on how to use this.
The fabric-device identifier matches the block device identifier.
$ smartio_tool list
40600: Non-Volatile memory controller ... [in use]
80b00: Non-Volatile memory controller ... [in use]
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
dis40600n1 259:0 0 349.3G 0 disk
dis80b00n1 259:1 0 119.2G 0 disk
Enabling debug output¶
Debug output can be enabled for a running driver by setting the debug_level
parameter in /sys/module/dis_nvme_manager/parameters/debug_level and
/sys/module/dis_nvme/parameters/debug_level for the manager and client
respectively. This parameter may also be passed on insmod.
Set the debug_level to a value in the range [0-3]:
Debug output disabled (only info and warning level)
Debug
Verbose
Trace
Known limitations¶
Currently the NVMe driver does not support creating and attaching NVMe namespaces.
Support for devices with more than one namespace is buggy.
Use the nvme-cli util on the local system to manage namespaces before
loading the dis_nvme drivers.
Unloading the dis_nvme driver, or forcefully reclaiming NVMe devices in use by the lender, is known to create issues on systems if there is a mounted file system. Before unloading the driver, make sure to unmount any mounted file systems.
Interrupt polling¶
The NVMe drivers use interrupt flags. You should consider tweaking the
polling control option (control-intr-poll with dis_tool) and find
a trade-off between latency requirements and CPU utilization.
We recommend using “delayed on” option, which adapts polling according to threshold values.
However, keep in mind that tweaking this will affect SISCI programs and SuperSockets. Please refer to the eXpressWare documentation on interrupt polling.