Performance Testing¶

Last updated: 05/14/2026

This guide explains how to run performance tests for ray-ascend, including OpenYuanrong (YR) Direct Transport and HCCL collective communication performance tests.

For actual benchmark results and performance analysis, see the Performance Benchmark Report.

Overview¶

The performance test suite is located in tests/benchmarks/ and supports the following test types:

YR Direct Transport: Tests YR direct tensor transport performance
HCCL Collective Communication: Tests HCCL collective operations performance (Coming Soon)

All tests support the following common configurations:

Placement Modes: Local (single node) and Remote (distributed across nodes)
Devices: NPU and CPU tensors

Common Prerequisites¶

Required Dependencies¶

# Install ray-ascend with all features
pip install -e ".[all]"

For Remote Mode Testing¶

Set up a Ray cluster with two nodes:

Head Node:

ray start --head --resources='{"node:<HEAD_IP>": 1}'

Worker Node:

ray start --address <HEAD_IP>:6379 --resources='{"node:<WORKER_IP>": 1}'

Replace <HEAD_IP> and <WORKER_IP> with actual IP addresses.

YR Direct Transport Performance Test¶

Overview¶

Tests YR direct tensor transport performance between Ray actors. This test measures the throughput and latency of tensor transfers using YR Direct Transport.

Additional Prerequisites¶

YR Direct Transport requires etcd for service coordination. The test framework automatically starts and manages an etcd instance during test execution:

Local mode: etcd runs on localhost (127.0.0.1)
Remote mode: etcd runs on the head node

No manual etcd installation or configuration is required - the test harness handles etcd lifecycle management automatically.

Parameters¶

Parameter	Type	Choices	Default	Description
`--backend`	str	yr, hccl	required	Transport backend
`--init-mode`	str	etcd, metastore	etcd	YR backend initialization mode
`--placement`	str	local, remote	local	Test deployment mode
`--device`	str	npu, cpu	cpu	Device to run tensors on
`--head-node-ip`	str	-	-	IP address of Ray head node (required for remote mode)
`--worker-node-ip`	str	-	-	IP address of worker node (required for remote mode)
`--tensor-count`	int	-	1	Number of tensors to transport in the list
`--tensor-size-kb`	int	-	1024	Size of each tensor in KB (converted to float32 elements)
`--warmup-times`	int	-	2	Number of warmup iterations before measurement
`--count`	int	-	5	Number of test iterations (results are averaged)
`--config-file`	str	-	-	Path to YAML config file

Running the Test¶

Using Command Line Arguments¶

A simple case:

python tests/benchmarks/direct_transport_perftest.py --backend yr

Detailed settings:

# Local mode with single tensor
python tests/benchmarks/direct_transport_perftest.py \
  --backend yr \
  --placement local \
  --device cpu \
  --tensor-count 1 \
  --tensor-size-kb 1024 \
  --warmup-times 2 \
  --count 5


# Local mode with tensor list (transport 10 tensors)
python tests/benchmarks/direct_transport_perftest.py \
  --backend yr \
  --placement local \
  --device cpu \
  --tensor-count 10 \
  --tensor-size-kb 1024 \
  --warmup-times 2 \
  --count 5


# Remote mode (head: NODE_A ; worker: NODE_B)
python tests/benchmarks/direct_transport_perftest.py \
  --backend yr \
  --placement local \
  --device cpu \
  --head_node_ip NODE_A\
  --worker_node_ip NODE_B\
  --tensor-count 1 \
  --tensor-size-kb 1024 \
  --warmup-times 2 \
  --count 5

Using Configuration File¶

Create a YAML configuration file (e.g., config.yaml):

# Transport backend: 'yr' for YR Direct Transport, 'hccl' for HCCL
backend: yr
# YR backend initialization mode: 'etcd' or 'metastore'
init_mode: metastore
# Test deployment mode: 'local' (same node) or 'remote' (distributed)
placement: remote
# Device to run tensors on: 'npu' or 'cpu'
device: npu
# IP address of the Ray head node (required for remote mode)
head_node_ip: NODE_A
# IP address of the worker node (required for remote mode)
worker_node_ip: NODE_B
# Number of tensors to transport in the list
tensor_count: 1
# Size of each tensor in KB
tensor_size_kb: 1000
# Number of warmup iterations before measurement
warmup_times: 5
# Number of iterations for the actual test (results are averaged)
count: 100

Then run:

python tests/benchmarks/direct_transport_perftest.py --config-file custom_config_path/config.yaml

Command-line arguments override config file settings.

Example Configuration

An example configuration file is provided at tests/benchmarks/direct_transport_config.yaml.

HCCL Collective Communication Performance Test¶

Overview¶

Tests HCCL collective operations performance (all-reduce, all-gather, broadcast, etc.) across multiple workers.

Note: This test is under development and will be available in future releases.

Common Test Output¶

All performance tests output the following metrics:

Latency Percentiles: P50, P75, P90, P95, P99
Throughput Statistics: Average, Min, Max throughput in Gb/s
Total Data Size: In GB
Number of Iterations: Test iterations performed

Example Output¶

============================================================
YR LOCAL BANDWIDTH TEST SUMMARY
============================================================
Total Data Size: 0.001024 GB
Number of Iterations: 5
Average Transport Time: 0.00012345s
Average Transport Throughput: 66.35714286 Gb/s
Min Transport Throughput: 60.00000000 Gb/s
Max Transport Throughput: 70.00000000 Gb/s
P50 Latency: 0.00012000s
P75 Latency: 0.00012500s
P90 Latency: 0.00012800s
P95 Latency: 0.00013000s
P99 Latency: 0.00013200s

Architecture¶

The performance test framework consists of:

base_perftest.py: Abstract base class with common test infrastructure
direct_transport_perftest.py: YR Direct Transport specific implementation
direct_transport_config.yaml: Example configuration file

Test Flow¶

Initialize etcd service for coordination
Start DataSystem actors on target nodes
Create test actors (sender and receiver)
Warm-up iterations
Run performance measurements
Calculate and display statistics
Cleanup resources