View all files | ||||
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Now both the Transfer Engine and Mooncake Store are open-sourced! This repository also hosts its technical report and the open-sourced traces.
Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated KVCache pool.
The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges in highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake’s innovative architecture enables Kimi to handle 75% more requests.
Mooncake Core Component: Transfer Engine (TE) The core of Mooncake is the Transfer Engine (TE), which provides a unified interface for batched data transfer across various storage devices and network links. Supporting multiple protocols including TCP, RDMA, CXL/shared-memory, and NVMe over Fabric (NVMe-of), TE is designed to enable fast and reliable data transfer for AI workloads. Compared to Gloo (used by Distributed PyTorch) and traditional TCP, TE achieves significantly lower I/O latency, making it a superior solution for efficient data transmission.
P2P Store and Mooncake Store Both P2P Store and Mooncake Store are built on the Transfer Engine and provide key/value caching for different scenarios. P2P Store focuses on sharing temporary objects (e.g., checkpoint files) across nodes in a cluster, preventing bandwidth saturation on a single machine. Mooncake Store, on the other hand, supports distributed pooled KVCache, specifically designed for XpYd disaggregation to enhance resource utilization and system performance.
Mooncake Integration with Leading LLM Inference Systems Mooncake has been seamlessly integrated with several popular large language model (LLM) inference systems. Through collaboration with the vLLM and SGLang teams, Mooncake now officially supports prefill-decode disaggregation. By leveraging the high-efficiency communication capabilities of RDMA devices, Mooncake significantly improves inference efficiency in prefill-decode disaggregation scenarios, providing robust technical support for large-scale distributed inference tasks. In addition, Mooncake has been successfully integrated with SGLang's Hierarchical KV Caching, vLLM's prefill serving, and LMCache, augmenting KV cache management capabilities across large-scale inference scenarios.
Elastic Expert Parallelism Support Mooncake adds elasticity and fault tolerance support for MoE model inference, enabling inference systems to remain responsive and recoverable in the event of GPU failures or changes in resource configuration. This functionality includes automatic faulty rank detection and can work with the EPLB module to dynamically route tokens to healthy ranks during inference.
Tensor-Centric Ecosystem Mooncake establishes a full-stack, Tensor-oriented AI infrastructure where Tensors serve as the fundamental data carrier. The ecosystem spans from the Transfer Engine, which accelerates Tensor data movement across heterogeneous storage (DRAM/VRAM/NVMe), to the P2P Store and Mooncake Store for distributed management of Tensor objects (e.g., Checkpoints and KVCache), up to the Mooncake Backend enabling Tensor-based elastic distributed computing. This architecture is designed to maximize Tensor processing efficiency for large-scale model inference and training.
Mooncake supports heterogeneous accelerators, NICs, and specialized transport paths. The summary below focuses on runtime and transport coverage that is already exposed through build options, documented protocols, or dedicated examples in this repository.
| Huawei Ascend | Ascend NPUs | Supported | -DUSE_ASCEND=ON, -DUSE_ASCEND_DIRECT=ON, -DUSE_UBSHMEM=ON, -DUSE_ASCEND_HETEROGENEOUS=ON; covers HCCL transport, Ascend Direct transport, UBShmem transport, and heterogeneous Ascend-GPU transport |
| Cambricon | MLU + Neuware | Supported | -DUSE_MLU=ON; MLU memory detection, topology discovery, and registration reuse the standard rdma data path |
| Moore Threads | MUSA GPUs | Supported | -DUSE_MUSA=ON; accelerator-aware data transfer with MUSA runtime integration |
| MetaX (Muxi) | MACA GPUs | Supported | -DUSE_MACA=ON; source build support through the MACA SDK |
| T-Head | PPU / Barex | Supported | T-Head PPU deployments are represented here through Barex-based transport support |
| NVIDIA | CUDA GPUs / NVLink | Supported | -DUSE_CUDA=ON, -DUSE_INTRA_NVLINK=ON, -DUSE_MNNVL=ON; covers CUDA memory, GPUDirect RDMA, GPUDirect Storage, intra-node NVLink, and multi-node NVLink |
| AMD | ROCm / HIP GPUs | Supported | -DUSE_HIP=ON; HIP transport for AMD GPU communication |
| Hygon | DCU / DTK | Supported | -DUSE_HYGON=ON; CUDA-compatible runtime via Hygon DTK SDK |
| Iluvatar | CoreX | Supported | -DUSE_COREX=ON; CUDA-compatible runtime via Iluvatar CoreX SDK |
| Alibaba Cloud | eRDMA NICs | Supported | rdma data path with eRDMA devices such as erdma_0; the build also enables CONFIG_ERDMA |
| Standard RDMA ecosystem | InfiniBand / RoCE NICs | Supported | Available through the standard rdma protocol path with topology-aware NIC selection |
| AWS | Elastic Fabric Adapter (EFA) | Supported | -DUSE_EFA=ON; EFA transport built on libfabric SRD |
| Storage disaggregation | NVMe-oF | Supported | Enabled with -DUSE_NVMEOF=ON |
| Memory pooling | CXL | Supported | Enabled with -DUSE_CXL=ON |
| Baseline networking | TCP/IP | Supported | tcp works in all environments |
| Ascend HCCL transport | Supported | Enabled by -DUSE_ASCEND=ON; examples use hccl for Ascend NPU data movement |
| Ascend Direct transport | Supported | Enabled by -DUSE_ASCEND_DIRECT=ON; dedicated Ascend Direct examples and docs are included |
| UBShmem transport | Supported | Enabled by -DUSE_UBSHMEM=ON; Transfer Engine examples accept --protocol=ubshmem |
| Heterogeneous Ascend transport | Supported | Enabled by -DUSE_ASCEND_HETEROGENEOUS=ON; used for Ascend-GPU heterogeneous transfer |
| Barex transport | Supported | Enabled by -DUSE_BAREX=ON; documented as the barex advanced transport |
| Sunrise Transport | Supported | Included here as an additional specialized transport path to reflect current hardware support positioning |
| T-Head PPU / Barex | Supported | Barex-based transport coverage is available for T-Head PPU deployments |
Transfer Engine is a high-performance data transfer framework. Transfer Engine provides a unified interface to transfer data from DRAM, VRAM or NVMe, while the technical details related to hardware are hidden. Transfer Engine supports multiple communication protocols including TCP, RDMA (InfiniBand/RoCEv2/eRDMA/NVIDIA GPUDirect), AWS EFA, NVMe over Fabric (NVMe-of), NVLink, HIP, Barex, CXL, and Ascend-family transports. When built with the corresponding runtime, Transfer Engine can also detect and route accelerator memory on CUDA, MUSA, HIP, MACA, Cambricon MLU, and Ascend-enabled environments. For a complete list of supported protocols and configuration guide, see the Supported Protocols Documentation.
Efficient use of multiple RDMA NIC devices. Transfer Engine supports the use of multiple RDMA NIC devices to achieve the aggregation of transfer bandwidth.
Topology aware path selection. Transfer Engine can select optimal devices based on the location (NUMA affinity, etc.) of both source and destination.
More robust against temporary network errors. Once transmission fails, Transfer Engine will try to use alternative paths for data delivery automatically.
With 40 GB of data (equivalent to the size of the KVCache generated by 128k tokens in the LLaMA3-70B model), Mooncake Transfer Engine delivers up to 87 GB/s and 190 GB/s of bandwidth in 4×200 Gbps and 8×400 Gbps RoCE networks respectively, which are about 2.4x and 4.6x faster than the TCP protocol.
P2P Store is built on the Transfer Engine and supports sharing temporary objects between peer nodes in a cluster. P2P Store is ideal for scenarios like checkpoint transfer, where data needs to be rapidly and efficiently shared across a cluster. P2P Store has been used in the checkpoint transfer service of Moonshot AI.
Decentralized architecture. P2P Store leverages a pure client-side architecture with global metadata managed by the etcd service.
Efficient data distribution. Designed to enhance the efficiency of large-scale data distribution, P2P Store avoids bandwidth saturation issues by allowing replicated nodes to share data directly. This reduces the CPU/RDMA NIC pressures of data providers (e.g., trainers).
Mooncake Store is a distributed KVCache storage engine specialized for LLM inference based on Transfer Engine. It is the central component of the KVCache-centric disaggregated architecture. The goal of Mooncake Store is to store the reusable KV caches across various locations in an inference cluster. Mooncake Store has been supported in SGLang's Hierarchical KV Caching, vLLM's prefill serving and is now integrated with LMCache to provide enhanced KVCache management capabilities.
Multi-replica support: Mooncake Store supports storing multiple data replicas for the same object, effectively alleviating hotspots in access pressure.
High bandwidth utilization: Mooncake Store supports striping and parallel I/O transfer of large objects, fully utilizing multi-NIC aggregated bandwidth for high-speed data reads and writes.
SGLang officially supports Mooncake Store as a HiCache storage backend. This integration enables scalable KV cache retention and high-performance access for large-scale LLM serving scenarios.
To optimize LLM inference, the vLLM community is working on supporting disaggregated prefilling (PR 10502). This feature allows separating the prefill phase from the decode phase in different processes. The vLLM uses nccl and gloo as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.
We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of nccl and gloo, to support inter-node KVCache transfer (PR 10884). Transfer Engine provides simpler interfaces and more efficient use of RDMA devices.
We will soon release the new vLLM integration based on Mooncake Store, which supports xPyD prefill/decode disaggregation.
Update[Dec 16, 2024]: Here is the latest vLLM Integration (Guide v0.2) that is based on vLLM's main branch.
By supporting Topology Aware Path Selection and multi-card bandwidth aggregation, Mean TTFT of vLLM with Transfer Engine is up to 25% lower than traditional TCP-based transports. In the future, we will further improve TTFT through GPUDirect RDMA and zero-copy.
| Transfer Engine (RDMA) | 12.06 | 2042.74 | 1056.76 | 635.00 | 4006.59 |
| TCP | 12.05 | 2041.13 | 1414.05 | 766.23 | 6035.36 |
More advanced features are coming soon, so stay tuned!
Mooncake is designed and optimized for high-speed RDMA networks. Though Mooncake supports TCP-only data transfer, we strongly recommend users to evaluate the functionality and performance of Mooncake with RDMA network support.
The following need to be installed before running any component of Mooncake:
The simplest way to use Mooncake Transfer Engine is using pip:
For CUDA-enabled systems:
For non-CUDA systems:
Important
Mooncake supports Docker-based deployment, see Build Guide in detail.
To produce an image that compiles Mooncake from source, builds the wheel via scripts/build_wheel.sh, and installs that wheel inside the container, use build-wheel.dockerfile:
The resulting image already has a virtual environment at /opt/venv with the freshly built wheel installed. Launch it with GPU/RDMA access as needed, for example:
The 64gb / 56gb values above are tuned examples for large HiCache deployments, not allocator defaults. The arena is off by default. Setting MC_MMAP_ARENA_POOL_SIZE=... explicitly both enables and sizes the arena; if you enable it via gflag instead, the default pool size is 8gb. On smaller hosts, start with 8gb or 16gb and size upward with the helper. Set MC_DISABLE_MMAP_ARENA=1 (also accepts true, yes, or on) instead when you want the baseline direct-mmap() path. Like the arena size itself, this must be set before the first Mooncake mmap-buffer allocation in the process. Arena bring-up is a one-shot lazy init, so after a failed first attempt you need to restart the process to retry with corrected env / hugepage settings. Without MC_STORE_USE_HUGEPAGE=1, the arena may opportunistically try hugepages and then retry on regular pages if HugeTLB is unavailable. When MC_STORE_USE_HUGEPAGE=1 is present, Mooncake instead preserves the strict hugepage contract for both arena and direct-mmap() host-buffer allocation instead of silently downgrading to regular pages.
Note
Make sure you build the image from the repository root so that Git metadata and submodules are available inside the build context.
The following are additional dependencies for building Mooncake:
The build and installation steps are as follows:
Retrieve source code from GitHub repo
Install dependencies
Compile Mooncake and examples
For Cambricon MLU builds, configure CMake with -DUSE_MLU=ON. For example:
The above presents two samples from our trace dataset. The trace includes the timing of request arrivals, the number of input tokens, the number of output tokens, and the remapped block hash. To protect our customers' privacy, we applied several mechanisms to remove user-related information while preserving the dataset's utility for simulated evaluation. More descriptions of the trace (e.g., up to 50% cache hit ratio) can be found in Section 4 of the technical report.
Update[Feb 21, 2025]: The updated traces used in our FAST'25 paper have been released! Please refer to the paper's appendix (found here) for more details.
Please kindly cite our paper if you find the paper or the traces are useful: