fio

This page demonstrates how NVMe storage performs across different deployment models, comparing local PCIe-attached devices with remote access over RDMA. Using fio, an industry-standard storage benchmarking tool, we evaluate how SmartIO Device Lending enables near-local performance when accessing remote NVMe devices.

We benchmark latency, random IOPS, and sequential throughput across several configurations to highlight performance differences and similarities:

  • Local baseline: Direct-attached NVMe over PCIe

  • Device lending: Remote NVMe accessed via SmartIO Device Lending

  • NVMe-oF (SPDK): Remote NVMe using Linux RDMA (initiator) and SPDK (target)

  • NVMe-oF (kernel): Remote NVMe using Linux RDMA on both initiator and target

These results illustrate how Device Lending compares to traditional NVMe-oF solutions, and how close it can get to local NVMe performance while enabling flexible, disaggregated storage architectures.

Installation

Follow the instructions on https://github.com/axboe/fio.

Benchmarking setup

We use our 4-node server-grade experimental setup with the borrower on node A and the lender on node B.


../../../_images/milanq-nvme-topology.svg

Group

Category

Details

System

Topology

4 nodes named A, B, C, D

CPU

Dual-socket AMD EPYC 7763 64-Core

Motherboard

Supermicro H12DSU-iN

Model name

Supermicro AS -2024US-TRT

Memory

16x 128GiB DDR4 DIMMs at 3.2 GHz

Operating system

Ubuntu 22.04-hwe

Kernel

Linux 6.8

PCIe

Host adapter cards

MXH930 PCIe 4.0 NTB Host Adapter

Switch

MXS924 PCIe 4.0 Switch

PCIe cables

PCIe 4.0 SFF-8644 Cables

Driver version

5.26 (estimated release later in 2026)

GPUs

NVIDIA GPUs

2x A100 40GB (node A and B)

AMD GPUs

2x AMD Instinct MI210 (node C and D)

NVIDIA driver version

590

CUDA toolkit version

11.8

NCCL version

2.28.9-1

nccl-tests version

2.16.5

Storage

Storage

PM1733 Entry NVMe PCIe SSD (1.92 TB)

fio version

3.41

SPDK version

26.01

RDMA

RDMA NIC

Mellanox ConnectX-6 200Gbps using InfiniBand

RDMA switch

NVIDIA QM8700

OFED version

24.10

Running fio

We define a total of 6 fio jobfiles. These jobfiles measure reading and writing latency, random access IOPS, and sequential throughput. The jobfiles maximize IOPS and throughput when the NVMe is local. Then, we run all 6 jobfiles with the 4 following NVMe placements: Local baseline, Device lending, NVMe-oF (SPDK), and NVMe-oF (kernel).

We execute each job with the following arguments:

$ fio --filename /dev/<nvme-namespace>  \
      --direct=1                        \
      --buffered=0                      \
      --invalidate=1                    \
      --randrepeat=1                    \
      --random_generator=lfsr           \
      --norandommap                     \
      --thread                          \
      --cpus_allowed_policy=split       \
      --time_based=1                    \
      --ramp_time=5s                    \
      --runtime=30s                     \
      --group_reporting                 \
      --per_job_logs=0                  \
      --output-format=json              \
      --output=<output>.json            \
        <path-to-jobfile>

Note

We set the thread affinity to the CPUs on the same CCD as the relevant I/O device. For measuring the local baseline, we bind the benchmarking thread to the CPUs closest to the local NVMe. For Device lending experiments, we bind it close to NTB. For NVMe-oF experiments, we bind it to the CPUs closest to the Mellanox HCA. This can be done with taskset, numactl, or the cpus_allowed argument to fio.

Our NVMe SSD runs with the following max performance:

Metric

Vendor reported performance

Measured performance

Random read latency

100 μs

~72 μs

Random write latency

25 μs

~19 μs

Random read IOPS

800 K

~1.029 M

Random write IOPS

100 K

~134 K

Sequential read throughput

7.00 GB/s

~7.13 GB/s*

Sequential write throughput

2.40 GB/s

~2.43 GB/s

* 7.13 GB/s is close to the maximum bandwidth of a PCIe 4.0 x4 link. The NVMe therefore reaches the upper limit sequential read throughput.

Latency results

Reads

For read latency, we use the following jobfile and plot the distribution.

[lat-read-rand]
description=Random access, read-only workload for measuring latency
ioengine=sync
numjobs=1
blocksize=4K
readwrite=randread
name=lat-read-rand
write_lat_log=/tmp/lat-read-rand

We see that the Device lending latency is close to local latency, only set back by minor software overhead on the borrower side, and link traversal latency. NVMe-oF pays in performance on both the initiator and the target side, leading to slightly higher latency in the data path.

Writes

For write latency, we use the following jobfile.

[lat-write-rand]
description=Random access, write-only workload for measuring latency
ioengine=sync
numjobs=1
blocksize=4K
readwrite=randwrite
name=lat-write-rand
write_lat_log=/tmp/lat-write-rand

IOPS results

Reads

[iops-read-rand]
description=Random access, read-only workload for measuring IOPS
ioengine=io_uring
iodepth=128
numjobs=8
blocksize=4K
readwrite=randread
fixedbufs=1
name=iops-read-rand
write_iops_log=/tmp/iops-read-rand
log_avg_msec=10

If we zoom in to display the bulk of the distribution, we see that the performance is close to optimal on all configurations, including Device lending.

Writes

To mitigate variance when maximizing write IOPS, we set a target IOPS equal to the local maximum using rate_iops=16750 and rate_process=poisson. 16750 IOPS multiplied by 8 threads is 134 K, which is the mean IOPS we see when the NVMe is local.

[iops-write-rand]
description=Random access, write-only workload for measuring IOPS
ioengine=io_uring
iodepth=128
numjobs=8
blocksize=4K
readwrite=randwrite
fixedbufs=1
rate_iops=16750
rate_process=poisson
name=iops-write-rand
write_iops_log=/tmp/iops-write-rand
log_avg_msec=10

Again, we zoom in on the distribution, and see that Device lending performs close to the local baseline and NVMe-oF.

Throughput results

For throughput, we set target bandwidth 7.5 GB/s for reads and 3.0 GB/s for writes, and do a “sweep run”, varying block size, number of jobs and queue depth. Then we extract the best performing configuration. The configuration for each topology therefore differs:

Metric

Block size

Queue depth

Number of jobs

Local baseline

128 M

128

1

Device lending

128 M

128

1

NVMe-oF (SPDK)

128 K

128

8

NVMe-oF (kernel)

128 K

128

8

Reads

Writes

References