fio¶
This page demonstrates how NVMe storage performs across different deployment models, comparing local PCIe-attached devices with remote access over RDMA. Using fio, an industry-standard storage benchmarking tool, we evaluate how SmartIO Device Lending enables near-local performance when accessing remote NVMe devices.
We benchmark latency, random IOPS, and sequential throughput across several configurations to highlight performance differences and similarities:
Local baseline: Direct-attached NVMe over PCIe
Device lending: Remote NVMe accessed via SmartIO Device Lending
NVMe-oF (SPDK): Remote NVMe using Linux RDMA (initiator) and SPDK (target)
NVMe-oF (kernel): Remote NVMe using Linux RDMA on both initiator and target
These results illustrate how Device Lending compares to traditional NVMe-oF solutions, and how close it can get to local NVMe performance while enabling flexible, disaggregated storage architectures.
Installation¶
Follow the instructions on https://github.com/axboe/fio.
Benchmarking setup¶
We use our 4-node server-grade experimental setup with the borrower on node A and the lender on node B.
Group |
Category |
Details |
|---|---|---|
System |
Topology |
4 nodes named A, B, C, D |
CPU |
Dual-socket AMD EPYC 7763 64-Core |
|
Motherboard |
Supermicro H12DSU-iN |
|
Model name |
Supermicro AS -2024US-TRT |
|
Memory |
16x 128GiB DDR4 DIMMs at 3.2 GHz |
|
Operating system |
Ubuntu 22.04-hwe |
|
Kernel |
Linux 6.8 |
|
PCIe |
Host adapter cards |
|
Switch |
||
PCIe cables |
||
Driver version |
5.26 (estimated release later in 2026) |
|
GPUs |
NVIDIA GPUs |
2x A100 40GB (node A and B) |
AMD GPUs |
2x AMD Instinct MI210 (node C and D) |
|
NVIDIA driver version |
590 |
|
CUDA toolkit version |
11.8 |
|
NCCL version |
2.28.9-1 |
|
nccl-tests version |
2.16.5 |
|
Storage |
Storage |
PM1733 Entry NVMe PCIe SSD (1.92 TB) |
fio version |
3.41 |
|
SPDK version |
26.01 |
|
RDMA |
RDMA NIC |
Mellanox ConnectX-6 200Gbps using InfiniBand |
RDMA switch |
NVIDIA QM8700 |
|
OFED version |
24.10 |
Running fio¶
We define a total of 6 fio jobfiles. These jobfiles measure reading and writing latency, random access IOPS, and sequential throughput. The jobfiles maximize IOPS and throughput when the NVMe is local. Then, we run all 6 jobfiles with the 4 following NVMe placements: Local baseline, Device lending, NVMe-oF (SPDK), and NVMe-oF (kernel).
We execute each job with the following arguments:
$ fio --filename /dev/<nvme-namespace> \
--direct=1 \
--buffered=0 \
--invalidate=1 \
--randrepeat=1 \
--random_generator=lfsr \
--norandommap \
--thread \
--cpus_allowed_policy=split \
--time_based=1 \
--ramp_time=5s \
--runtime=30s \
--group_reporting \
--per_job_logs=0 \
--output-format=json \
--output=<output>.json \
<path-to-jobfile>
Note
We set the thread affinity to the CPUs on the same CCD as the relevant I/O device. For measuring the local baseline, we bind the benchmarking thread to the CPUs closest to the local NVMe. For Device lending experiments, we bind it close to NTB. For NVMe-oF experiments, we bind it to the CPUs closest to the Mellanox HCA. This can be done with taskset, numactl, or the cpus_allowed argument to fio.
Our NVMe SSD runs with the following max performance:
Metric |
Vendor reported performance |
Measured performance |
|---|---|---|
Random read latency |
100 μs |
~72 μs |
Random write latency |
25 μs |
~19 μs |
Random read IOPS |
800 K |
~1.029 M |
Random write IOPS |
100 K |
~134 K |
Sequential read throughput |
7.00 GB/s |
~7.13 GB/s* |
Sequential write throughput |
2.40 GB/s |
~2.43 GB/s |
* 7.13 GB/s is close to the maximum bandwidth of a PCIe 4.0 x4 link. The NVMe therefore reaches the upper limit sequential read throughput.
Latency results¶
Reads¶
For read latency, we use the following jobfile and plot the distribution.
[lat-read-rand]
description=Random access, read-only workload for measuring latency
ioengine=sync
numjobs=1
blocksize=4K
readwrite=randread
name=lat-read-rand
write_lat_log=/tmp/lat-read-rand
We see that the Device lending latency is close to local latency, only set back by minor software overhead on the borrower side, and link traversal latency. NVMe-oF pays in performance on both the initiator and the target side, leading to slightly higher latency in the data path.
Writes¶
For write latency, we use the following jobfile.
[lat-write-rand]
description=Random access, write-only workload for measuring latency
ioengine=sync
numjobs=1
blocksize=4K
readwrite=randwrite
name=lat-write-rand
write_lat_log=/tmp/lat-write-rand
IOPS results¶
Reads¶
[iops-read-rand]
description=Random access, read-only workload for measuring IOPS
ioengine=io_uring
iodepth=128
numjobs=8
blocksize=4K
readwrite=randread
fixedbufs=1
name=iops-read-rand
write_iops_log=/tmp/iops-read-rand
log_avg_msec=10
If we zoom in to display the bulk of the distribution, we see that the performance is close to optimal on all configurations, including Device lending.
Writes¶
To mitigate variance when maximizing write IOPS, we set a target IOPS equal to the local maximum using rate_iops=16750 and rate_process=poisson. 16750 IOPS multiplied by 8 threads is 134 K, which is the mean IOPS we see when the NVMe is local.
[iops-write-rand]
description=Random access, write-only workload for measuring IOPS
ioengine=io_uring
iodepth=128
numjobs=8
blocksize=4K
readwrite=randwrite
fixedbufs=1
rate_iops=16750
rate_process=poisson
name=iops-write-rand
write_iops_log=/tmp/iops-write-rand
log_avg_msec=10
Again, we zoom in on the distribution, and see that Device lending performs close to the local baseline and NVMe-oF.
Throughput results¶
For throughput, we set target bandwidth 7.5 GB/s for reads and 3.0 GB/s for writes, and do a “sweep run”, varying block size, number of jobs and queue depth. Then we extract the best performing configuration. The configuration for each topology therefore differs:
Metric |
Block size |
Queue depth |
Number of jobs |
|---|---|---|---|
Local baseline |
128 M |
128 |
1 |
Device lending |
128 M |
128 |
1 |
NVMe-oF (SPDK) |
128 K |
128 |
8 |
NVMe-oF (kernel) |
128 K |
128 |
8 |