Performance tuning

Performance-tuning guidance is often Linux-focused. However, similar tuning is usually required on other operating systems and should be applied where equivalent controls exist.

CPU recommendations

NUMA topology

On systems with multiple NUMA nodes, ensure applications are pinned to the NUMA node local to the Dolphin adapter. Failing to do so can introduce significant latency penalties due to accessing memory located on another NUMA node.

On Linux, this can be achieved using: numactl --membind=<NUMA> --cpunodebind=<NUMA> ./program

CPU C- and P-states

Modern CPUs expose both idle (C-states) and performance (P-states).

C-states control power-saving during idle periods, with C0 being the active state. Transitions back to C0 can be costly, depending on the architecture. For example, AMD recommends disabling deeper C-states (such as C2) when using low-latency networking products.

P-states control performance scaling while the CPU is active, with P0 representing the highest guaranteed performance level. On some systems, optimal performance is achieved by pinning the CPU to P0.

Boosting

Many processors support frequency boosting (e.g., Intel Turbo Boost or AMD Precision Boost), allowing clock speeds above P0 when sufficient thermal and electrical headroom exists.

Enabling boosting typically improves peak performance but may reduce determinism and increase run-to-run variance.

CPU frequency scaling

Some modern systems offer finer-grained frequency control than traditional P-states. On such systems, optimal performance may be achieved by enabling vendor-specific scaling drivers rather than forcing P0.

Examples include intel_pstate and amd-pstate, which are supported by newer Linux kernels.

AMD EPYC tuning

Newer AMD EPYC processors expose a large number of tunable BIOS parameters. Optimal settings are workload- and platform-dependent, so experimentation is recommended. That said, the following settings have proven to be good baseline choices for performance-focused and low-latency workloads. Refer to AMD’s HPC Workload Tuning Guides for more details.

  • Determinism Control: Manual

  • Determinism Slider: Power

    Takes advantage of yield variability and can provide improved runtime performance

  • APBDIS: 1

    Disables Algorithmic Performance Boost to allow setting the Fixed SoC P-state.

  • Fixed SoC P-state: P0

    Pins the SoC to its highest performance state.

  • DF C-states: Disabled

    Prevents the Data Fabric from entering low-power states, reducing latency.

  • cTDP Control: Manual

  • cTDP: Maximum

    Allows the CPU to operate at its highest configured thermal design power.

  • Package Power Limit Control: Manual

  • Package Power Limit: Maximum

    Removes package-level power constraints that may throttle performance.

TCP/IP and SuperSockets tuning

IOMMU

IOMMU should generally be disabled for network performance, unless required for virtualization or device isolation.

Simultaneous multithreading (SMT)

Experiment with enabling or disabling SMT. Leaving SMT enabled can introduce performance variability, while disabling it may reduce throughput on systems with a limited number of cores.

Socket buffer sizes

On Linux, network throughput can be limited by socket buffer sizes. Default values are intentionally conservative and may severely cap performance on modern hardware.

To avoid this bottleneck, increase:

  • net.core.rmem_max

  • net.core.wmem_max

A typical value is: 268435456

TCP auto-tuning ranges

Linux dynamically scales TCP buffer sizes within the ranges defined by:

  • net.ipv4.tcp_rmem

  • net.ipv4.tcp_wmem

The defaults are conservative and may be insufficient for high-throughput links. Consider using more aggressive settings, for example: 4096 87380 134217728

Congestion control

The default congestion control algorithm is usually reasonable. Both cubic and bbr are good choices for high-throughput, low-latency networks that Dolphin provide.

SISCI tuning

On some generations of AMD processors (e.g. 7003-series EPYC), SISCI PIO performance can be significantly improved by using the provided hand-written memory copy functions.

These can be benchmarked using: /opt/DIS/bin/scimemcopybench

Identify the function that delivers the desired performance and select it using: SetSciMemCopyFunction(function_number)

before invoking SCIMemCpy.

Additional gains may be possible by implementing custom copy functions and compiling them with architecture-specific optimizations.