fio¶
fio is the most widely used storage and file system tester, supporting a range of storage engines and configurations. This page describes benchmarking results using NVMe Device lending, and compares the data with local setups and NVMe-oF solutions.
Installation¶
Follow the instructions on https://github.com/axboe/fio.
Benchmarking setup¶
We use our 4-node server-grade experimental setup with the borrower on node A and the lender on node B.
Group |
Category |
Details |
|---|---|---|
System |
Topology |
4 nodes named A, B, C, D |
CPU |
Dual-socket AMD EPYC 7763 64-Core |
|
Motherboard |
Supermicro H12DSU-iN |
|
Model name |
Supermicro AS -2024US-TRT |
|
Memory |
16x 128GiB DDR4 DIMMs at 3.2 GHz |
|
Operating system |
Ubuntu 22.04-hwe |
|
Kernel |
Linux 6.8 |
|
PCIe |
Host adapter cards |
|
Switch |
||
PCIe cables |
||
Driver version |
||
GPUs |
NVIDIA GPUs |
2x A100 40GB (node A and B) |
AMD GPUs |
2x AMD Instinct MI210 (node C and D) |
|
NVIDIA driver version |
590 |
|
CUDA toolkit version |
11.8 |
|
NCCL version |
2.28.9-1 |
|
nccl-tests version |
2.16.5 |
|
Storage |
Storage |
PM1733 Entry NVMe PCIe SSD (1.92 TB) |
fio version |
3.41 |
|
SPDK version |
26.01 |
|
RDMA |
RDMA NIC |
Mellanox ConnectX-6 200Gbps using InfiniBand |
RDMA switch |
NVIDIA QM8700 |
|
OFED version |
24.10 |
Running fio¶
We define a total of 6 fio jobfiles. These jobfiles measure reading and writing latency, random access IOPS, and sequential throughput. The jobfiles maximize IOPS and throughput when the NVMe is local. Then, we run all 6 jobfiles with the following NVMe placements:
Local baseline: Local NVMe
Device lending: Remote NVMe with Device lending
NVMe-oF (SPDK): Remote NVMe with the Linux RDMA driver on the initiator and the SPDK driver on the target
NVMe-oF (kernel): Remote NVMe with the Linux RDMA driver on both initiator and target
We execute each job with the following arguments:
$ fio --filename /dev/<nvme-namespace> \
--direct=1 \
--buffered=0 \
--invalidate=1 \
--randrepeat=1 \
--random_generator=lfsr \
--norandommap \
--thread \
--cpus_allowed_policy=split \
--time_based=1 \
--ramp_time=5s \
--runtime=30s \
--group_reporting \
--per_job_logs=0 \
--output-format=json \
--output=<output>.json \
<path-to-jobfile>
Note
We set the thread affinity to the CPUs on the same CCD as the relevant I/O device. For measuring the local baseline, we bind the benchmarking thread to the CPUs closest to the local NVMe. For Device lending experiments, we bind it close to NTB. For NVMe-oF experiments, we bind it to the CPUs closest to the Mellanox HCA. This can be done with taskset, numactl, or the cpus_allowed argument to fio.
Our NVMe SSD runs with the following max performance:
Metric |
Vendor reported performance |
Measured performance |
|---|---|---|
Random read latency |
100 μs |
~72 μs |
Random write latency |
25 μs |
~19 μs |
Random read IOPS |
800 K |
~1.029 M |
Random write IOPS |
100 K |
~134 K |
Sequential read throughput |
7.00 GB/s |
~7.13 GB/s* |
Sequential write throughput |
2.40 GB/s |
~2.84 GB/s |
* 7.13 GB/s is close to the maximum bandwidth of a PCIe 4.0 x4 link. The NVMe therefore reaches the upper limit sequential read throughput.
Latency results¶
Reads¶
For read latency, we use the following jobfile and plot the distribution.
[lat-read-rand]
description=Random access, read-only workload for measuring latency
ioengine=sync
numjobs=1
blocksize=4K
readwrite=randread
name=lat-read-rand
write_lat_log=/tmp/lat-read-rand
We see that the Device lending latency is close to local latency, only set back by minor software overhead on the borrower side, and link traversal latency. NVMe-oF pays in performance on both the initiator and the target side, leading to slightly higher latency in the data path.
Writes¶
For write latency, we use the following jobfile.
[lat-write-rand]
description=Random access, write-only workload for measuring latency
ioengine=sync
numjobs=1
blocksize=4K
readwrite=randwrite
name=lat-write-rand
write_lat_log=/tmp/lat-write-rand
IOPS results¶
Reads¶
[iops-read-rand]
description=Random access, read-only workload for measuring IOPS
ioengine=io_uring
iodepth=128
numjobs=8
blocksize=4K
readwrite=randread
fixedbufs=1
name=iops-read-rand
write_iops_log=/tmp/iops-read-rand
log_avg_msec=10
If we zoom in to display the bulk of the distribution, we see that the performance is close to optimal on all configurations, including Device lending.
Writes¶
To mitigate variance when maximizing write IOPS, we set a target IOPS equal to the local maximum using rate_iops=16750 and rate_process=poisson. 16750 IOPS multiplied by 8 threads is 134 K, which is the mean IOPS we see when the NVMe is local.
[iops-write-rand]
description=Random access, write-only workload for measuring IOPS
ioengine=io_uring
iodepth=128
numjobs=8
blocksize=4K
readwrite=randwrite
fixedbufs=1
rate_iops=16750
rate_process=poisson
name=iops-write-rand
write_iops_log=/tmp/iops-write-rand
log_avg_msec=10
Again, we zoom in on the distribution, and see that Device lending performs close to the local baseline and NVMe-oF.
Throughput results¶
For throughput, we set target bandwidth 7.5 GB/s for reads and 3.0 GB/s for writes, and do a “sweep run”, varying block size, number of jobs and queue depth. Then we extract the best performing configuration. The configuration for each topology therefore differs.
Reads¶
Metric |
Block size |
Queue depth |
Number of jobs |
|---|---|---|---|
Local baseline |
64 M |
128 |
8 |
Device lending |
128 M |
128 |
1 |
NVMe-oF (SPDK) |
64 M |
64 |
8 |
NVMe-oF (kernel) |
128 M |
32 |
8 |
Writes¶
Metric |
Block size |
Queue depth |
Number of jobs |
|---|---|---|---|
Local baseline |
128 M |
128 |
1 |
Device lending |
128 M |
128 |
1 |
NVMe-oF (SPDK) |
128 K |
128 |
4 |
NVMe-oF (kernel) |
16 M |
16 |
8 |