fio

fio is the most widely used storage and file system tester, supporting a range of storage engines and configurations. This page describes benchmarking results using NVMe Device lending, and compares the data with local setups and NVMe-oF solutions.

Installation

Follow the instructions on https://github.com/axboe/fio.

Benchmarking setup

We use our 4-node server-grade experimental setup with the borrower on node A and the lender on node B.

Group

Category

Details

System

Topology

4 nodes named A, B, C, D

CPU

Dual-socket AMD EPYC 7763 64-Core

Motherboard

Supermicro H12DSU-iN

Model name

Supermicro AS -2024US-TRT

Memory

16x 128GiB DDR4 DIMMs at 3.2 GHz

Operating system

Ubuntu 22.04-hwe

Kernel

Linux 6.8

PCIe

Host adapter cards

MXH930 PCIe 4.0 NTB Host Adapter

Switch

MXS924 PCIe 4.0 Switch

PCIe cables

PCIe 4.0 SFF-8644 Cables

Driver version

5.25

GPUs

NVIDIA GPUs

2x A100 40GB (node A and B)

AMD GPUs

2x AMD Instinct MI210 (node C and D)

NVIDIA driver version

590

CUDA toolkit version

11.8

NCCL version

2.28.9-1

nccl-tests version

2.16.5

Storage

Storage

PM1733 Entry NVMe PCIe SSD (1.92 TB)

fio version

3.41

SPDK version

26.01

RDMA

RDMA NIC

Mellanox ConnectX-6 200Gbps using InfiniBand

RDMA switch

NVIDIA QM8700

OFED version

24.10

Running fio

We define a total of 6 fio jobfiles. These jobfiles measure reading and writing latency, random access IOPS, and sequential throughput. The jobfiles maximize IOPS and throughput when the NVMe is local. Then, we run all 6 jobfiles with the following NVMe placements:

  • Local baseline: Local NVMe

  • Device lending: Remote NVMe with Device lending

  • NVMe-oF (SPDK): Remote NVMe with the Linux RDMA driver on the initiator and the SPDK driver on the target

  • NVMe-oF (kernel): Remote NVMe with the Linux RDMA driver on both initiator and target

We execute each job with the following arguments:

$ fio --filename /dev/<nvme-namespace>  \
      --direct=1                        \
      --buffered=0                      \
      --invalidate=1                    \
      --randrepeat=1                    \
      --random_generator=lfsr           \
      --norandommap                     \
      --thread                          \
      --cpus_allowed_policy=split       \
      --time_based=1                    \
      --ramp_time=5s                    \
      --runtime=30s                     \
      --group_reporting                 \
      --per_job_logs=0                  \
      --output-format=json              \
      --output=<output>.json            \
        <path-to-jobfile>

Note

We set the thread affinity to the CPUs on the same CCD as the relevant I/O device. For measuring the local baseline, we bind the benchmarking thread to the CPUs closest to the local NVMe. For Device lending experiments, we bind it close to NTB. For NVMe-oF experiments, we bind it to the CPUs closest to the Mellanox HCA. This can be done with taskset, numactl, or the cpus_allowed argument to fio.

Our NVMe SSD runs with the following max performance:

Metric

Vendor reported performance

Measured performance

Random read latency

100 μs

~72 μs

Random write latency

25 μs

~19 μs

Random read IOPS

800 K

~1.029 M

Random write IOPS

100 K

~134 K

Sequential read throughput

7.00 GB/s

~7.13 GB/s*

Sequential write throughput

2.40 GB/s

~2.84 GB/s

* 7.13 GB/s is close to the maximum bandwidth of a PCIe 4.0 x4 link. The NVMe therefore reaches the upper limit sequential read throughput.

Latency results

Reads

For read latency, we use the following jobfile and plot the distribution.

[lat-read-rand]
description=Random access, read-only workload for measuring latency
ioengine=sync
numjobs=1
blocksize=4K
readwrite=randread
name=lat-read-rand
write_lat_log=/tmp/lat-read-rand

We see that the Device lending latency is close to local latency, only set back by minor software overhead on the borrower side, and link traversal latency. NVMe-oF pays in performance on both the initiator and the target side, leading to slightly higher latency in the data path.

Writes

For write latency, we use the following jobfile.

[lat-write-rand]
description=Random access, write-only workload for measuring latency
ioengine=sync
numjobs=1
blocksize=4K
readwrite=randwrite
name=lat-write-rand
write_lat_log=/tmp/lat-write-rand

IOPS results

Reads

[iops-read-rand]
description=Random access, read-only workload for measuring IOPS
ioengine=io_uring
iodepth=128
numjobs=8
blocksize=4K
readwrite=randread
fixedbufs=1
name=iops-read-rand
write_iops_log=/tmp/iops-read-rand
log_avg_msec=10

If we zoom in to display the bulk of the distribution, we see that the performance is close to optimal on all configurations, including Device lending.

Writes

To mitigate variance when maximizing write IOPS, we set a target IOPS equal to the local maximum using rate_iops=16750 and rate_process=poisson. 16750 IOPS multiplied by 8 threads is 134 K, which is the mean IOPS we see when the NVMe is local.

[iops-write-rand]
description=Random access, write-only workload for measuring IOPS
ioengine=io_uring
iodepth=128
numjobs=8
blocksize=4K
readwrite=randwrite
fixedbufs=1
rate_iops=16750
rate_process=poisson
name=iops-write-rand
write_iops_log=/tmp/iops-write-rand
log_avg_msec=10

Again, we zoom in on the distribution, and see that Device lending performs close to the local baseline and NVMe-oF.

Throughput results

For throughput, we set target bandwidth 7.5 GB/s for reads and 3.0 GB/s for writes, and do a “sweep run”, varying block size, number of jobs and queue depth. Then we extract the best performing configuration. The configuration for each topology therefore differs.

Reads

Metric

Block size

Queue depth

Number of jobs

Local baseline

64 M

128

8

Device lending

128 M

128

1

NVMe-oF (SPDK)

64 M

64

8

NVMe-oF (kernel)

128 M

32

8

Writes

Metric

Block size

Queue depth

Number of jobs

Local baseline

128 M

128

1

Device lending

128 M

128

1

NVMe-oF (SPDK)

128 K

128

4

NVMe-oF (kernel)

16 M

16

8

References