fio¶

This page demonstrates how NVMe storage performs across different deployment models, comparing local PCIe-attached devices with remote access over RDMA. Using fio, an industry-standard storage benchmarking tool, we evaluate how SmartIO Device Lending enables near-local performance when accessing remote NVMe devices.

We benchmark latency, random IOPS, and sequential throughput across several configurations to highlight performance differences and similarities:

Local baseline: Direct-attached NVMe over PCIe
Device lending: Remote NVMe accessed via SmartIO Device Lending
NVMe-oF (SPDK): Remote NVMe using Linux RDMA (initiator) and SPDK (target)
NVMe-oF (kernel): Remote NVMe using Linux RDMA on both initiator and target

These results illustrate how Device Lending compares to traditional NVMe-oF solutions, and how close it can get to local NVMe performance while enabling flexible, disaggregated storage architectures.

Installation¶

Follow the instructions on https://github.com/axboe/fio.

Benchmarking setup¶

We use our 4-node server-grade experimental setup with the borrower on node A and the lender on node B.

../../../_images/milanq-nvme-topology.svg

Group	Category	Details
System	Topology	4 nodes named A, B, C, D
	CPU	Dual-socket AMD EPYC 7763 64-Core
	Motherboard	Supermicro H12DSU-iN
	Model name	Supermicro AS -2024US-TRT
	Memory	16x 128GiB DDR4 DIMMs at 3.2 GHz
	Operating system	Ubuntu 22.04-hwe
	Kernel	Linux 6.8
PCIe	Host adapter cards	MXH930 PCIe 4.0 NTB Host Adapter
	Switch	MXS924 PCIe 4.0 Switch
	PCIe cables	PCIe 4.0 SFF-8644 Cables
	Driver version	5.26 (estimated release later in 2026)
GPUs	NVIDIA GPUs	2x A100 40GB (node A and B)
	AMD GPUs	2x AMD Instinct MI210 (node C and D)
	NVIDIA driver version	590
	CUDA toolkit version	11.8
	NCCL version	2.28.9-1
	nccl-tests version	2.16.5
Storage	Storage	PM1733 Enterprise NVMe PCIe SSD (1.92 TB)
	fio version	3.41
	SPDK version	26.01
RDMA	RDMA NIC	Mellanox ConnectX-6 200Gbps using InfiniBand
	RDMA switch	NVIDIA QM8700
	OFED version	24.10

Running fio¶

We define a total of 6 fio jobfiles. These jobfiles measure reading and writing latency, random access IOPS, and sequential throughput. The jobfiles maximize IOPS and throughput when the NVMe is local. Then, we run all 6 jobfiles with the 4 following NVMe placements: Local baseline, Device lending, NVMe-oF (SPDK), and NVMe-oF (kernel).

We execute each job with the following arguments:

$ fio --filename /dev/<nvme-namespace>  \
      --direct=1                        \
      --buffered=0                      \
      --invalidate=1                    \
      --randrepeat=1                    \
      --random_generator=lfsr           \
      --norandommap                     \
      --thread                          \
      --cpus_allowed_policy=split       \
      --time_based=1                    \
      --ramp_time=5s                    \
      --runtime=30s                     \
      --group_reporting                 \
      --per_job_logs=0                  \
      --output-format=json              \
      --output=<output>.json            \
        <path-to-jobfile>

Note

We set the thread affinity to the CPUs on the same CCD as the relevant I/O device. For measuring the local baseline, we bind the benchmarking thread to the CPUs closest to the local NVMe. For Device lending experiments, we bind it close to NTB. For NVMe-oF experiments, we bind it to the CPUs closest to the Mellanox HCA. This can be done with taskset, numactl, or the cpus_allowed argument to fio.

Our NVMe SSD runs with the following max performance:

Metric	Vendor reported performance	Measured performance
Random read latency	100 μs	~72 μs
Random write latency	25 μs	~19 μs
Random read IOPS	800 K	~1.029 M
Random write IOPS	100 K	~134 K
Sequential read throughput	7.00 GB/s	~7.13 GB/s*
Sequential write throughput	2.40 GB/s	~2.43 GB/s

* 7.13 GB/s is close to the maximum bandwidth of a PCIe 4.0 x4 link. The NVMe therefore reaches the upper limit sequential read throughput.

Latency results¶

Reads¶

For read latency, we use the following jobfile and plot the distribution.

[lat-read-rand]
description=Random access, read-only workload for measuring latency
ioengine=sync
numjobs=1
blocksize=4K
readwrite=randread
name=lat-read-rand
write_lat_log=/tmp/lat-read-rand

We see that the Device lending latency is close to local latency, only set back by minor software overhead on the borrower side, and link traversal latency. NVMe-oF pays in performance on both the initiator and the target side, leading to slightly higher latency in the data path.

Writes¶

For write latency, we use the following jobfile.

[lat-write-rand]
description=Random access, write-only workload for measuring latency
ioengine=sync
numjobs=1
blocksize=4K
readwrite=randwrite
name=lat-write-rand
write_lat_log=/tmp/lat-write-rand

IOPS results¶

Reads¶

[iops-read-rand]
description=Random access, read-only workload for measuring IOPS
ioengine=io_uring
iodepth=128
numjobs=8
blocksize=4K
readwrite=randread
fixedbufs=1
name=iops-read-rand
write_iops_log=/tmp/iops-read-rand
log_avg_msec=10

If we zoom in to display the bulk of the distribution, we see that the performance is close to optimal on all configurations, including Device lending.

Writes¶

To mitigate variance when maximizing write IOPS, we set a target IOPS equal to the local maximum using rate_iops=16750 and rate_process=poisson. 16750 IOPS multiplied by 8 threads is 134 K, which is the mean IOPS we see when the NVMe is local.

[iops-write-rand]
description=Random access, write-only workload for measuring IOPS
ioengine=io_uring
iodepth=128
numjobs=8
blocksize=4K
readwrite=randwrite
fixedbufs=1
rate_iops=16750
rate_process=poisson
name=iops-write-rand
write_iops_log=/tmp/iops-write-rand
log_avg_msec=10

Again, we zoom in on the distribution, and see that Device lending performs close to the local baseline and NVMe-oF.

Throughput results¶

For throughput, we set target bandwidth 7.5 GB/s for reads and 3.0 GB/s for writes, and do a “sweep run”, varying block size, number of jobs and queue depth. Then we extract the best performing configuration. The configuration for each topology therefore differs:

Metric	Block size	Queue depth	Number of jobs
Local baseline	128 M	128	1
Device lending	128 M	128	1
NVMe-oF (SPDK)	128 K	128	8
NVMe-oF (kernel)	128 K	128	8

fio¶

Installation¶

Benchmarking setup¶

Running fio¶

Latency results¶

Reads¶

Writes¶

IOPS results¶

Reads¶

Writes¶

Throughput results¶

Reads¶

Writes¶

References¶