Performance Testing¶
Last updated: 03/30/2026
This guide explains how to run performance tests for ray-ascend, including OpenYuanrong (YR) Direct Transport and HCCL collective communication performance tests.
Overview¶
The performance test suite is located in tests/benchmarks/ and supports the following
test types:
- YR Direct Transport: Tests YR direct tensor transport performance
- HCCL Collective Communication: Tests HCCL collective operations performance (Coming Soon)
All tests support the following common configurations:
- Placement Modes: Local (single node) and Remote (distributed across nodes)
- Devices: NPU and CPU tensors
Common Prerequisites¶
Required Dependencies¶
For Remote Mode Testing¶
Set up a Ray cluster with two nodes:
Head Node:
Worker Node:
Replace <HEAD_IP> and <WORKER_IP> with actual IP addresses.
YR Direct Transport Performance Test¶
Overview¶
Tests YR direct tensor transport performance between Ray actors. This test measures the throughput and latency of tensor transfers using YR Direct Transport.
Additional Prerequisites¶
YR Direct Transport requires etcd for service coordination. The test framework automatically starts and manages an etcd instance during test execution:
- Local mode: etcd runs on localhost (127.0.0.1)
- Remote mode: etcd runs on the head node
No manual etcd installation or configuration is required - the test harness handles etcd lifecycle management automatically.
Parameters¶
| Parameter | Type | Choices | Default | Description |
|---|---|---|---|---|
--backend |
str | yr, hccl | required | Transport backend |
--placement |
str | local, remote | local | Test deployment mode |
--device |
str | npu, cpu | cpu | Device to run tensors on |
--head-node-ip |
str | - | - | IP address of Ray head node (required for remote mode) |
--worker-node-ip |
str | - | - | IP address of worker node (required for remote mode) |
--tensor-size-kb |
int | - | 1024 | Total number of KB in tensor (converted to float32 elements) |
--warmup-times |
int | - | 2 | Number of warmup iterations before measurement |
--count |
int | - | 5 | Number of test iterations (results are averaged) |
--config-file |
str | - | - | Path to YAML config file |
Running the Test¶
Using Command Line Arguments¶
A simple case:
Detailed settings:
# Local mode
python tests/benchmarks/direct_transport_perftest.py \
--backend yr \
--placement local \
--device cpu \
--tensor-size-kb 1024 \
--warmup-times 2 \
--count 5
# Remote mode (head: NODE_A ; worker: NODE_B)
python tests/benchmarks/direct_transport_perftest.py \
--backend yr \
--placement local \
--device cpu \
--head_node_ip NODE_A\
--worker_node_ip NODE_B\
--tensor-size-kb 1024 \
--warmup-times 2 \
--count 5
Using Configuration File¶
Create a YAML configuration file (e.g., config.yaml):
# Transport backend: 'yr' for YR Direct Transport, 'hccl' for HCCL
backend: yr
# Test deployment mode: 'local' (same node) or 'remote' (distributed)
placement: remote
# Device to run tensors on: 'npu' or 'cpu'
device: npu
# IP address of the Ray head node (required for remote mode)
head_node_ip: NODE_A
# IP address of the worker node (required for remote mode)
worker_node_ip: NODE_B
# Total tensor size in KB
tensor_size_kb: 1000
# Number of warmup iterations before measurement
warmup_times: 5
# Number of iterations for the actual test (results are averaged)
count: 100
Then run:
Command-line arguments override config file settings.
Example Configuration
An example configuration file is provided at
tests/benchmarks/direct_transport_config.yaml.
HCCL Collective Communication Performance Test¶
Overview¶
Tests HCCL collective operations performance (all-reduce, all-gather, broadcast, etc.) across multiple workers.
Note: This test is under development and will be available in future releases.
Common Test Output¶
All performance tests output the following metrics:
- Latency Percentiles: P50, P75, P90, P95, P99
- Throughput Statistics: Average, Min, Max throughput in Gb/s
- Total Data Size: In GB
- Number of Iterations: Test iterations performed
Example Output¶
============================================================
YR LOCAL BANDWIDTH TEST SUMMARY
============================================================
Total Data Size: 0.001024 GB
Number of Iterations: 5
Average Transport Time: 0.00012345s
Average Transport Throughput: 66.35714286 Gb/s
Min Transport Throughput: 60.00000000 Gb/s
Max Transport Throughput: 70.00000000 Gb/s
P50 Latency: 0.00012000s
P75 Latency: 0.00012500s
P90 Latency: 0.00012800s
P95 Latency: 0.00013000s
P99 Latency: 0.00013200s
Architecture¶
The performance test framework consists of:
base_perftest.py: Abstract base class with common test infrastructuredirect_transport_perftest.py: YR Direct Transport specific implementationdirect_transport_config.yaml: Example configuration file
Test Flow¶
- Initialize etcd service for coordination
- Start DataSystem actors on target nodes
- Create test actors (sender and receiver)
- Warm-up iterations
- Run performance measurements
- Calculate and display statistics
- Cleanup resources