- [2025-10] 🎉🎉 Efficiency Report: We provide comprehensive Model FLOPs Utilization (MFU) metrics for various model architectures and training configurations. See MFU Reference for detailed benchmarks.
- [2025-10] 🚀🚀 LMMs-Engine v0.1 is here! a lean, efficient framework built to train unified multimodal model at scale.
# Clone the repository
git clone https://github.com/LMMs-Lab/lmms-engine.git
cd lmms-engine
# Install editable packages
uv pip install -e ".[all]"
# or install as a packages
uv pip install -e .
# Install a stable release
uv pip install lmms-engine
# Install dependencies using uv sync
# For Linux systems (recommended - auto-detects platform):
bash uv_sync_linux.sh
# For other systems or if encountering errors:
uv sync
# If uv sync fails, try:
uv pip install -r requirements.txt
# Optional: Performance optimizations
uv pip install flash-attn --no-build-isolation
uv pip install liger-kernel
We provide Docker images with pre-built environments including PyTorch, CUDA, and all necessary dependencies.
docker run --gpus all -it --rm \
-v $(pwd):/workspace \
-w /workspace \
fatbao55/lmms-engine:v1.0 \
bash
Recommended: torchrun (native PyTorch)
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 \
--master_addr=127.0.0.1 --master_port=12355 \
-m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml
Alternative: Accelerate
accelerate launch --use_fsdp \
-m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml
Single GPU
python -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml
Model
Quick Start
FSDP2
TP
USP
Muon
Liger
Packing
NSA
EP
Highlights
| BAGEL |
run.sh |
✅ |
❌ |
TBD |
✅ |
❌ |
✅ |
✅ |
❌ |
Unified visual understanding & generation |
| Qwen2.5 |
run.sh |
✅ |
❌ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
Large Language Model |
| Qwen2.5-VL |
run.sh |
✅ |
❌ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
Multimodal Model |
| Qwen2.5-Omni |
run.sh |
✅ |
❌ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
Unified multimodal (image, audio, text) |
| Qwen3-VL |
run.sh |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
Native-resolution, long context (10K+ tokens) |
| Qwen3-VL MoE |
run.sh |
✅ |
❌ |
✅ |
✅ |
✅ |
✅ |
❌ |
✅ |
Vision-Language MoE with EP (image, video, text) |
| Qwen3-MoE |
run.sh |
✅ |
❌ |
❌ |
✅ |
✅ |
✅ |
❌ |
✅ |
Mixture-of-Experts, Expert Parallelism |
| Qwen3-Omni MoE |
config |
✅ |
❌ |
❌ |
✅ |
✅ |
✅ |
❌ |
✅ |
Multimodal MoE with EP (image, audio, text) |
| WanVideo |
run.sh |
✅ |
❌ |
❌ |
✅ |
❌ |
❌ |
❌ |
❌ |
T2V/I2V/V2V generation (1.3B/14B) |
| FLA models |
run.sh |
✅ |
❌ |
❌ |
✅ |
❌ |
✅ |
❌ |
❌ |
Efficient architecture, FineWeb-Edu pretraining |
| dLLM (Qwen3) |
run.sh |
✅ |
❌ |
❌ |
✅ |
❌ |
❌ |
❌ |
❌ |
Masked diffusion language model |
| RAE-SigLip |
run.sh |
✅ |
❌ |
❌ |
✅ |
❌ |
❌ |
❌ |
❌ |
Representation AutoEncoder, LPIPS, EMA |
| SiT |
run.sh |
✅ |
❌ |
❌ |
✅ |
❌ |
❌ |
❌ |
❌ |
Interpolant Transformer, CFG, ImageNet-1K |
Optimization Legend:
- FSDP2: Fully Sharded Data Parallel v2 for distributed training
- TP: Tensor Parallelism for sharding model compute across GPUs
- USP: Ulysses Sequence Parallel for long contexts
- Muon: Advanced optimizer with Newton-Schulz orthogonalization
- Liger: Triton fused kernels (CrossEntropy, RMSNorm, RoPE, SwiGLU) for 30% memory reduction
- Packing: First-fit bin packing for peaking at 35-40% MFU vs 20-25% (w/o in Qwen2.5-VL finetuning)
- NSA: Native Sparse Attention for efficient long-context processing
- EP: Expert Parallelism for Mixture-of-Experts models, sharding experts across GPUs
💡 Tip: Each run.sh file contains detailed setup instructions, prerequisites, and configuration options.
20+ architectures spanning vision-language, diffusion, and language models.
- Qwen2.5-VL - SOTA level performance vision-language model
- Qwen3-VL - SOTA level performance vision-language model
- Qwen3-VL MoE - Vision-Language Mixture-of-Experts with Expert Parallelism and Sequence Parallelism support
- Qwen2.5-Omni - Unified vision + audio + text modalities
- Qwen3-Omni MoE - Multimodal Mixture-of-Experts with vision + audio + text and Expert Parallelism support
- LLaVA-OneVision - Fully open-source vision-language model
- Bagel - Unified multimodal model for visual understanding and generation
- Aero - Lightweight audio-language model
Diffusion & Generative Models
- dLLM (Qwen3) - Diffusion Language Model with masked prediction
- WanVideo (1.3B/14B) - Text/Image-to-Video generation (T2V/I2V/V2V)
- SiT (XL/2) - Scalable Interpolant Transformers for class-conditional image generation
- RAE-SigLip - Representation AutoEncoder with adversarial discriminator
- Qwen2/2.5/3 series - Full Liger kernel support with fused operations
- Linear Attention Models - Recurrent architecture optimized for Muon; Please install FLA first.
- Custom architectures - Extensible via @register_model() decorator
Production-grade efficiency from distributed training to kernel fusion.
Core Distributed Training
-
FSDP2 - PyTorch 2.0+ DTensor-based sharding for parameters, gradients, and optimizer states. Improved composability over original FSDP enables flexible parallelism composition.
-
Ulysses Sequence Parallel - Splits sequence dimension across GPUs for ultra-long contexts. Critical for vision-language models like Qwen3-VL with 10K+ visual tokens.
-
Multi-dimensional Parallelism - Compose TP × Ulysses SP/CP × DP meshes for cluster-scale training.
Memory & Compute Optimizations
-
Flash Attention + Unpadding - Tiled attention with use_rmpad eliminates all padding computation.
-
Native Sparse Attention (NSA) - Hybrid attention mechanism combining compressed attention, topk sparse attention, and sliding window attention.
-
Liger Kernel - Triton fused kernels (CrossEntropy, RMSNorm, RoPE, SwiGLU) achieve memory reduction by avoiding intermediate materializations.
-
Monkey Patching System - Runtime kernel injection via lmms_engine/configs/monkey_patch/ for model-specific optimizations without code modification.
-
Sequence Packing - Faster first-fit bin packing.
- Muon Optimizer - Newton-Schulz orthogonalization with Triton kernels, distributed via DTensor. Selective 2D-parameter application outperforms AdamW convergence.
- Streaming Datasets - IterableDataset for trillion-token pretraining without full data loading.
Sequence Packing - with full unpadding
dataset_config:
packing: true
packing_strategy: first_fit
packing_length: 32000
trainer_args:
use_rmpad: true # Requires flash-attn
use_liger_kernel: true
Liger Kernel - Enable LinkedIn's Triton kernels for 30% memory reduction
trainer_args:
use_liger_kernel: true
Fused operations:
- CrossEntropy (major memory savings)
- RMSNorm, RoPE, SwiGLU
- Automatically applied via monkey patching
Muon Optimizer - State-of-the-art optimizer for LLMs
trainer_args:
use_muon: true # enable muonwithadam optimizer
adam_beta1: 0.9 # for the adam part in muonwithadam optimizer
adam_beta2: 0.999 # for the adam part in muonwithadam optimizer
adam_epsilon: 1.0e-8 # for the adam part in muonwithadam optimizer
learning_rate: 0.001
weight_decay: 0.01
# ns_steps: 5 # Newton-Schulz iterations (default)
# for some modules which the user hope to
Features:
- Newton-Schulz orthogonalization with Triton kernels
- Distributed via DTensor (FSDP2)
- Selective 2D parameter application
Note
If users wish to specify whether a module should be optimized using Muon or Adam, they can designate this in lmms_engine.train.hf.trainer.create_optimizer. By default, modules excluded from Muon optimization include those containing the following substrings in their names: ["emb", "norm", "lm_head", "bias", "wte", "wpe", "output", "a_proj", "b_proj", "conv1d", "rotary"]
as well as any parameters whose dimension does not equal 2.
FSDP2 Configuration
trainer_args:
fsdp2: true
fsdp_config:
transformer_layer_cls_to_wrap: ["Qwen2VLDecoderLayer"]
reshard_after_forward: false
activation_checkpointing: true
Ulysses Sequence Parallel - For long-sequence VLMs
trainer_args:
sp_ulysses_degree: 2 # Sequence parallel degree
Benefits:
- Splits sequence length across GPUs
- Reduces memory footprint for long contexts
- Works with Flash Attention
Native Sparse Attention (NSA) - Efficient long-context attention for BAGEL
model_config:
load_from_pretrained_path: "lmms-lab/BAGEL-7B-MoT-ver.LE"
monkey_patch:
- type: nsa
model_type: bagel
kwargs:
block_size: 64
compress_type: "weightedpool" # weightedpool, linear, avgpool
kernel_size: 32
kernel_stride: 16
topk: 16
init_blocks: 1
local_blocks: 2
window_size: 512
Features:
- Compressed attention with key-value compression
- TopK sparse attention for efficiency
- Sliding window attention for local context
- Hybrid mechanism combines all three attention types
- Requires: pip install git+https://github.com/XunhaoLai/native-sparse-attention-triton.git
Note: Currently only supported for BAGEL model.
-
Process the dataset into OpenAI chat format (JSONL/JSON/Arrow/CSV)
hf download kcz358/open-thoughts-debug --local-dir data/open_thoughts_debug --repo-type dataset
-
Prepare dataset YAML (optional for single data source)
datasets:
- path: data/open_thoughts_debug
data_folder: ""
data_type: arrow
-
Configure training - See examples/qwen3_vl/example_config.yaml or any model-specific config in examples/
Getting Started:
Advanced Topics:
Factory Pattern enables easy extensibility:
# Register a custom dataset
from lmms_engine.datasets import register_dataset, BaseDataset
@register_dataset("my_custom_dataset")
class MyCustomDataset(BaseDataset):
def __init__(self, config):
super().__init__(config)
# Custom initialization
def __getitem__(self, idx):
# Custom data loading
return item
# Register a custom processor
from lmms_engine.datasets.processor import register_processor
@register_processor("my_custom_processor")
class MyCustomProcessor:
def __call__(self, raw_data):
# Custom processing
return processed_data
Builder Pattern for flexible composition:
from lmms_engine.train import TrainRunner
# Configuration defines the pipeline
runner = TrainRunner(config)
runner.build() # Lazy initialization of components
runner.run() # Execute training
Pipeline stages:
- Model initialization - From pretrained or config
- Dataset creation - With processor and collator
- Monkey patching - Apply kernel optimizations
- Trainer setup - FSDP2, DeepSpeed, or custom
- Training execution - With checkpointing and logging
Trainer Type
Use Case
Key Features
| hf_trainer |
General VLM/LM training |
FSDP2, Muon, Liger, Flash Attn |
| dllm_trainer |
Diffusion language models |
Masked LM, custom loss, DLLM collator |
| wan_trainer |
Video generation |
Flow-matching, multi-modal inputs |
| rae_trainer |
Visual autoencoders |
Adversarial loss, EMA, LPIPS |
| sit_trainer |
Diffusion transformers |
Interpolant framework, CFG, EMA |
- Vision-Language Pretraining - Qwen-VL, LLaVA on large multimodal datasets
- Video Understanding - AERO on 3D video data
- Diffusion Models - DLLM, SiT, WanVideo for generation tasks
- Representation Learning - RAE for visual representations
- Language Model Pretraining - DGN, Qwen with Muon optimizer
- Multimodal Fine-tuning - Efficient SFT with sequence packing
We welcome contributions! Please see our Design Principles for coding guidelines:
- Simplicity: Write simple, straightforward code
- Readability: Prioritize clarity over cleverness
- Testability: Create testable components
- Minimal Changes: Only modify code related to the task
- Less Code = Less Debt: Minimize code footprint
Thanks to the following projects for their excellent work:
If you use LMMs Engine in your research, please cite:
@software{lmms_engine2025,
title={LMMs Engine: A simple, unified multimodal framework for pretraining and finetuning.},
author={LMMs-Lab},
year={2025},
url={https://github.com/LMMs-Lab/lmms-engine}
}
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
🎉 Awesome projects using LMMs-Engine
Built with ❤️ by LMMs-Lab
⭐ Star us on GitHub to support the project! ⭐