Mooncake Transfer Engine supports multiple communication protocols for data transfer between nodes in a cluster. The protocol selection depends on your hardware capabilities and performance requirements.
tcp |
Standard network |
General purpose, works everywhere |
✅ Primary |
rdma |
RDMA-capable NIC |
High-performance, low-latency |
✅ Primary |
efa |
AWS EFA-capable instance |
High-performance on AWS (libfabric SRD) |
✅ Primary |
nvmeof |
NVMe-oF capable storage |
Direct NVMe storage access |
⚠️ Advanced |
nvlink |
NVIDIA MNNVL |
Inter-node GPU communication |
⚠️ Advanced |
nvlink_intra |
NVIDIA NVLink |
Intra-node GPU communication |
⚠️ Advanced |
hip |
AMD ROCm/HIP |
AMD GPU communication |
⚠️ Advanced |
barex |
RDMA-capable NIC |
Bare-metal RDMA extension |
⚠️ Advanced |
cxl |
CXL-capable hardware |
Memory pooling and sharing |
⚠️ Advanced |
ascend |
Huawei Ascend NPU |
Ascend NPU communication |
⚠️ Advanced |
Description: Standard TCP/IP network protocol.
Use When:
No special hardware is available
Testing or development environments
Compatibility is more important than performance
Configuration:
Advantages:
Works in all environments
No special hardware required
Simple setup
Limitations:
Lower throughput compared to RDMA
Higher CPU overhead
Higher latency
Description: Remote Direct Memory Access protocol providing high-performance, low-latency data transfer with minimal CPU overhead. Supports accelerator-aware memory registration, including NVIDIA GPUDirect RDMA for CUDA buffers and Cambricon MLU buffers when built with Neuware.
Hardware Support:
InfiniBand
RoCE (RDMA over Converged Ethernet)
eRDMA (Elastic RDMA)
NVIDIA GPUDirect RDMA
Non-NVIDAI GPUDirect RDMA (e.g., Intel E810 RDMA NIC)
Cambricon MLU memory via Neuware (-DUSE_MLU=ON)
Use When:
High-performance networking is required
RDMA-capable NICs are available
Low latency is critical (e.g., distributed inference, KV cache transfer)
Note: If no RDMA HCA (Host Channel Adapter) is detected on the system, the Transfer Engine will automatically fall back to TCP protocol for compatibility.
MLU Note: Cambricon MLU support uses the standard rdma data path. There is no separate mlu protocol string. To enable MLU memory detection, topology discovery, and DMA-BUF based registration, build Transfer Engine with -DUSE_MLU=ON and make Neuware available through NEUWARE_HOME or NEUWARE_ROOT.
Configuration:
Device Discovery: To find available RDMA devices on your system:
Advantages:
Very high throughput (up to 200 Gbps per NIC)
Ultra-low latency (sub-microsecond)
Minimal CPU overhead
Supports GPUDirect RDMA for zero-copy GPU transfers
Multi-NIC bandwidth aggregation
Topology-aware path selection
Limitations:
Requires RDMA-capable hardware
May require elevated permissions (sudo)
More complex network configuration
Performance Tips:
Use multiple RDMA NICs for bandwidth aggregation
Enable GPUDirect RDMA for GPU memory transfers
Configure proper NUMA affinity for optimal performance
See Transfer Engine Benchmark Tuning for detailed optimization
Description: AWS EFA transport using libfabric’s Scalable Reliable Datagram (SRD) protocol, providing high-bandwidth RDMA-like performance on AWS instances without traditional RDMA support.
Use When:
Running on AWS EFA-enabled instances (e.g., p5e.48xlarge, p6-b200.48xlarge, p4d.24xlarge)
High-performance networking is required on AWS
Traditional RDMA (ibverbs QP) is not supported by the hardware
Configuration:
Build Requirements:
Note: -DUSE_CUDA=ON is required when transferring GPU memory. Without it, fallback to TCP protocol will fail with “Bad address” errors on GPU buffers.
Advantages:
High throughput (~170 GB/s with 8 EFA devices, tuned)
Bypasses kernel network stack
Available on all AWS EFA-enabled instances
Limitations:
AWS-only
Software-emulated RDMA writes (higher CPU overhead than true RDMA)
~88% of RoCE RDMA throughput
Documentation: See EFA Transport for build instructions, benchmarks, and tuning.
The following protocols are available at the C++ Transfer Engine level for specialized use cases. They are not commonly used through the Python API.
Description: Direct data transfer between NVMe storage and DRAM/VRAM using GPUDirect Storage, bypassing the CPU for zero-copy operations.
Use When:
Direct NVMe storage access is needed
Implementing multi-tier storage (DRAM/VRAM/NVMe)
Working with large datasets that don’t fit in memory
Requirements:
NVMe-oF capable storage
Properly mounted remote storage nodes
Description: NVIDIA MNNVL (Multi-Node NVLink) protocol for high-bandwidth, low-latency GPU-to-GPU communication across nodes.
Use When:
Inter-node GPU communication is required
Using NVIDIA MNNVL (Multi-Node NVLink)
Maximum GPU bandwidth is needed
Requirements:
NVIDIA MNNVL hardware
Compiled with USE_MNNVL=ON
Configuration:
Note: When protocol="rdma" is set and RDMA NICs exist, you must explicitly set MC_FORCE_MNNVL=true to use MNNVL instead of RDMA. If no RDMA HCA is detected, MNNVL will be used automatically.
Description: NVIDIA NVLink for GPU-to-GPU communication within a single node.
Use When:
Local GPU-to-GPU transfers are needed
Maximizing intra-node GPU bandwidth
Requirements:
NVIDIA NVLink hardware
Compiled with USE_INTRA_NVLINK=ON
Description: AMD ROCm/HIP transport for GPU communication using IPC handles or Shareable handles.
Use When:
Working with AMD GPUs
Need intra-node GPU communication on AMD hardware
Requirements:
AMD ROCm/HIP runtime
AMD GPUs
Description: Bare-metal RDMA extension protocol for specialized RDMA configurations.
Use When:
Advanced RDMA features are required
Custom RDMA configurations
Requirements:
RDMA-capable hardware
Specialized configuration
Description: Compute Express Link for memory pooling and sharing across devices.
Use When:
CXL memory pooling is available
Memory disaggregation is needed
Requirements:
CXL-capable hardware
Description: Huawei Ascend NPU communication using HCCL (Huawei Collective Communication Library) or direct transport.
Use When:
Working with Huawei Ascend NPUs
Distributed inference on Ascend hardware
Requirements:
Huawei Ascend NPU hardware
HCCL runtime
Documentation:
TCP Configuration:
RDMA Configuration:
Development/Testing |
tcp |
Simple setup, no special hardware |
Production Inference |
rdma |
Best performance and latency |
AWS Cloud (EFA instances) |
efa |
High performance on p5e, p6-b200, p4d, etc. |
Cloud Environments |
tcp or rdma (if available) |
Check cloud provider support |
Multi-tier Storage |
rdma + nvmeof |
Combine protocols for different layers |
AMD GPU Clusters |
rdma + hip |
Use HIP for local GPU communication |
Cambricon MLU Clusters |
rdma |
Build with -DUSE_MLU=ON; MLU uses the normal RDMA protocol |
Ascend NPU Clusters |
rdma + ascend |
Use Ascend for NPU-specific operations |
Check RDMA devices:
Verify network connectivity:
Check permissions:
RDMA may require elevated permissions
Run with sudo if necessary
Configure proper udev rules for non-root access
Firewall configuration:
Ensure RDMA ports are not blocked
Check InfiniBand subnet manager is running
If a protocol fails to initialize:
Verify hardware support
Check that required drivers are installed
Ensure compile-time flags are set correctly (for C++ protocols)
Fall back to TCP for basic functionality
Quick Start Guide - Getting started with Mooncake
Transfer Engine Design - Detailed architecture
Transfer Engine Benchmark - Performance tuning
Python API Reference - API documentation
Deployment Guide - Production deployment