How to write optimized SISCI code

The SISCI API provides a rich and powerful API for deploying applications over a remote memory access network. This section contains suggestions and ideas for application programmers to optimize performance. Which one to use depends on the nature of the application and system. Please consider the following:

  1. Most of the SISCI API setup and management functions are either communicating with remote systems or doing local adapter CSR register accesses. These functions should normally not be used in the “application performance critical inner loop”. Allocating memory segments, connecting to remote segments, creating interrupts, etc. should ideally be established when the application starts and removed before application exit.

  2. Avoid interrupts when not needed. The SISCI API offers a set of functions to create and trigger remote interrupts. The SISCI driver will use remote memory access internally to trigger a remote interrupt, but the system interrupt latency and overhead will be added to the time budget.

    1. Use polling to avoid interrupts: Polling a local SISCI memory address is a very low-cost operation. The first reference will re-load the local CPU cache; all following poll operations will only reference the cache until the remote update arrives and the CPU cache is invalidated and fetched. Is it possible to do some polling between other application tasks?

    2. Combine polling and interrupts. Poll for a short while before going to sleep – waiting for a remote interrupt.

    3. Use remote memory access to inform the remote node that an interrupt is required to wake up the other process.

    4. SCICreateInterrupt() should be called once, and not for every loop.

  3. Optimized error checking. The SCICheckSequence() operation typically requires 1-2us to complete. The PCI Express network is a reliable network, and data will not be lost unless there is a real problem – broken hardware, cable unplugged, switch power down, etc.

    1. Try to move the SCICheckSequence() out of the communication inner loop - maybe the application can implement a more relaxed error recovery model?

    2. Do not use SCICheckSequence() for small messages – implement a memory checksum mechanism where the sender calculates a checksum that can be verified by the receiver.

  4. Use PIO as an alternative to DMA. Transferring small amounts of data using DMA has a lot of overhead compared to a PIO posted write. Small transfers will in most cases benefit from using PIO. DMA uses less CPU overhead for large transfers.

    1. Implement a mixed transport model using PIO for small transfers and DMA for larger transfers.

    2. If you are doing direct remote writes using pointer arithmetic, try to optimize the size written, send one 8 byte (dword) instead of many chars, etc.

    3. The use of SCIMemCpy() will ensure optimal performance, especially if large amounts of data are copied to a remote system. If SCIMemCpy() is not desired, then flush the CPU buffers with SCICheckSequence(NULL, SCI_FLAG_FLUSH_CPU_BUFFERS_ONLY) or SCIFlush().

Note

If balancing between PIO and DMA is implemented, do not assume that the balance will be kept across heterogeneous hardware. Thresholds do change with CPU and IO chipsets.

  1. Implement a way to benchmark your transport and compare the results with the standard Dolphin SISCI benchmarks found in the software distribution (scibench2, scipp, intr_bench , dma_bench and reflective_bench). Review the code of the Dolphin benchmarks if you do not reach similar performance.

    1. Request support from Dolphin if you do not reach your expected results.

Using Dolphin Express PX hardware, data can be transmitted to remote memory in 0.54us. Message passing at the SISCI API can be implemented using the above listed techniques in 1us latency for 1 byte.