← 返回首页
GitHub - EvolvingLMMs-Lab/lmms-engine: A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale. · GitHub
Skip to content

Navigation Menu

Toggle navigation
Sign in
Appearance settings
Search or jump to...

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Resetting focus

EvolvingLMMs-Lab/lmms-engine

Go to file
Code

Repository files navigation

A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.

Quick StartExamplesModel SupportOptimizationsCodebase ArchitectureDocumentation

Annoucement

  • [2025-10] 🎉🎉 Efficiency Report: We provide comprehensive Model FLOPs Utilization (MFU) metrics for various model architectures and training configurations. See MFU Reference for detailed benchmarks.
  • [2025-10] 🚀🚀 LMMs-Engine v0.1 is here! a lean, efficient framework built to train unified multimodal model at scale.

🚀 Quick Start

Installation

# Clone the repository git clone https://github.com/LMMs-Lab/lmms-engine.git cd lmms-engine # Install editable packages uv pip install -e ".[all]" # or install as a packages uv pip install -e . # Install a stable release uv pip install lmms-engine # Install dependencies using uv sync # For Linux systems (recommended - auto-detects platform): bash uv_sync_linux.sh # For other systems or if encountering errors: uv sync # If uv sync fails, try: uv pip install -r requirements.txt # Optional: Performance optimizations uv pip install flash-attn --no-build-isolation uv pip install liger-kernel

Docker

We provide Docker images with pre-built environments including PyTorch, CUDA, and all necessary dependencies.

docker run --gpus all -it --rm \ -v $(pwd):/workspace \ -w /workspace \ fatbao55/lmms-engine:v1.0 \ bash

Launch Training

Recommended: torchrun (native PyTorch)

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 \ --master_addr=127.0.0.1 --master_port=12355 \ -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml

Alternative: Accelerate

accelerate launch --use_fsdp \ -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml

Single GPU

python -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml

🔥 Featured Examples

Model Quick Start FSDP2 TP USP Muon Liger Packing NSA EP Highlights
BAGEL run.sh TBD Unified visual understanding & generation
Qwen2.5 run.sh Large Language Model
Qwen2.5-VL run.sh Multimodal Model
Qwen2.5-Omni run.sh Unified multimodal (image, audio, text)
Qwen3-VL run.sh Native-resolution, long context (10K+ tokens)
Qwen3-VL MoE run.sh Vision-Language MoE with EP (image, video, text)
Qwen3-MoE run.sh Mixture-of-Experts, Expert Parallelism
Qwen3-Omni MoE config Multimodal MoE with EP (image, audio, text)
WanVideo run.sh T2V/I2V/V2V generation (1.3B/14B)
FLA models run.sh Efficient architecture, FineWeb-Edu pretraining
dLLM (Qwen3) run.sh Masked diffusion language model
RAE-SigLip run.sh Representation AutoEncoder, LPIPS, EMA
SiT run.sh Interpolant Transformer, CFG, ImageNet-1K

Optimization Legend:

  • FSDP2: Fully Sharded Data Parallel v2 for distributed training
  • TP: Tensor Parallelism for sharding model compute across GPUs
  • USP: Ulysses Sequence Parallel for long contexts
  • Muon: Advanced optimizer with Newton-Schulz orthogonalization
  • Liger: Triton fused kernels (CrossEntropy, RMSNorm, RoPE, SwiGLU) for 30% memory reduction
  • Packing: First-fit bin packing for peaking at 35-40% MFU vs 20-25% (w/o in Qwen2.5-VL finetuning)
  • NSA: Native Sparse Attention for efficient long-context processing
  • EP: Expert Parallelism for Mixture-of-Experts models, sharding experts across GPUs

💡 Tip: Each run.sh file contains detailed setup instructions, prerequisites, and configuration options.

🤖 Model Support

20+ architectures spanning vision-language, diffusion, and language models.

Multimodal Models

  • Qwen2.5-VL - SOTA level performance vision-language model
  • Qwen3-VL - SOTA level performance vision-language model
  • Qwen3-VL MoE - Vision-Language Mixture-of-Experts with Expert Parallelism and Sequence Parallelism support
  • Qwen2.5-Omni - Unified vision + audio + text modalities
  • Qwen3-Omni MoE - Multimodal Mixture-of-Experts with vision + audio + text and Expert Parallelism support
  • LLaVA-OneVision - Fully open-source vision-language model
  • Bagel - Unified multimodal model for visual understanding and generation
  • Aero - Lightweight audio-language model

Diffusion & Generative Models

  • dLLM (Qwen3) - Diffusion Language Model with masked prediction
  • WanVideo (1.3B/14B) - Text/Image-to-Video generation (T2V/I2V/V2V)
  • SiT (XL/2) - Scalable Interpolant Transformers for class-conditional image generation
  • RAE-SigLip - Representation AutoEncoder with adversarial discriminator

Language Models

  • Qwen2/2.5/3 series - Full Liger kernel support with fused operations
  • Linear Attention Models - Recurrent architecture optimized for Muon; Please install FLA first.
  • Custom architectures - Extensible via @register_model() decorator

⚡️ Optimizations

Production-grade efficiency from distributed training to kernel fusion.

Core Distributed Training

  • FSDP2 - PyTorch 2.0+ DTensor-based sharding for parameters, gradients, and optimizer states. Improved composability over original FSDP enables flexible parallelism composition.

  • Ulysses Sequence Parallel - Splits sequence dimension across GPUs for ultra-long contexts. Critical for vision-language models like Qwen3-VL with 10K+ visual tokens.

  • Multi-dimensional Parallelism - Compose TP × Ulysses SP/CP × DP meshes for cluster-scale training.

Memory & Compute Optimizations

  • Flash Attention + Unpadding - Tiled attention with use_rmpad eliminates all padding computation.

  • Native Sparse Attention (NSA) - Hybrid attention mechanism combining compressed attention, topk sparse attention, and sliding window attention.

  • Liger Kernel - Triton fused kernels (CrossEntropy, RMSNorm, RoPE, SwiGLU) achieve memory reduction by avoiding intermediate materializations.

  • Monkey Patching System - Runtime kernel injection via lmms_engine/configs/monkey_patch/ for model-specific optimizations without code modification.

  • Sequence Packing - Faster first-fit bin packing.

Advanced Optimizer

  • Muon Optimizer - Newton-Schulz orthogonalization with Triton kernels, distributed via DTensor. Selective 2D-parameter application outperforms AdamW convergence.

Data Pipeline

  • Streaming Datasets - IterableDataset for trillion-token pretraining without full data loading.

Configuration Examples

Sequence Packing - with full unpadding
dataset_config: packing: true packing_strategy: first_fit packing_length: 32000 trainer_args: use_rmpad: true # Requires flash-attn use_liger_kernel: true
Liger Kernel - Enable LinkedIn's Triton kernels for 30% memory reduction
trainer_args: use_liger_kernel: true

Fused operations:

  • CrossEntropy (major memory savings)
  • RMSNorm, RoPE, SwiGLU
  • Automatically applied via monkey patching
Muon Optimizer - State-of-the-art optimizer for LLMs
trainer_args: use_muon: true # enable muonwithadam optimizer adam_beta1: 0.9 # for the adam part in muonwithadam optimizer adam_beta2: 0.999 # for the adam part in muonwithadam optimizer adam_epsilon: 1.0e-8 # for the adam part in muonwithadam optimizer learning_rate: 0.001 weight_decay: 0.01 # ns_steps: 5 # Newton-Schulz iterations (default) # for some modules which the user hope to

Features:

  • Newton-Schulz orthogonalization with Triton kernels
  • Distributed via DTensor (FSDP2)
  • Selective 2D parameter application

Note If users wish to specify whether a module should be optimized using Muon or Adam, they can designate this in lmms_engine.train.hf.trainer.create_optimizer. By default, modules excluded from Muon optimization include those containing the following substrings in their names: ["emb", "norm", "lm_head", "bias", "wte", "wpe", "output", "a_proj", "b_proj", "conv1d", "rotary"] as well as any parameters whose dimension does not equal 2.

FSDP2 Configuration
trainer_args: fsdp2: true fsdp_config: transformer_layer_cls_to_wrap: ["Qwen2VLDecoderLayer"] reshard_after_forward: false activation_checkpointing: true
Ulysses Sequence Parallel - For long-sequence VLMs
trainer_args: sp_ulysses_degree: 2 # Sequence parallel degree

Benefits:

  • Splits sequence length across GPUs
  • Reduces memory footprint for long contexts
  • Works with Flash Attention
Native Sparse Attention (NSA) - Efficient long-context attention for BAGEL
model_config: load_from_pretrained_path: "lmms-lab/BAGEL-7B-MoT-ver.LE" monkey_patch: - type: nsa model_type: bagel kwargs: block_size: 64 compress_type: "weightedpool" # weightedpool, linear, avgpool kernel_size: 32 kernel_stride: 16 topk: 16 init_blocks: 1 local_blocks: 2 window_size: 512

Features:

  • Compressed attention with key-value compression
  • TopK sparse attention for efficiency
  • Sliding window attention for local context
  • Hybrid mechanism combines all three attention types
  • Requires: pip install git+https://github.com/XunhaoLai/native-sparse-attention-triton.git

Note: Currently only supported for BAGEL model.

📖 Documentation

Step-by-Step Workflow

  1. Process the dataset into OpenAI chat format (JSONL/JSON/Arrow/CSV)

    hf download kcz358/open-thoughts-debug --local-dir data/open_thoughts_debug --repo-type dataset
  2. Prepare dataset YAML (optional for single data source)

    datasets: - path: data/open_thoughts_debug data_folder: "" data_type: arrow
  3. Configure training - See examples/qwen3_vl/example_config.yaml or any model-specific config in examples/

Comprehensive Guides

Getting Started:

Advanced Topics:

🏗️ Codebase Architecture

Component Registry

Factory Pattern enables easy extensibility:

# Register a custom dataset from lmms_engine.datasets import register_dataset, BaseDataset @register_dataset("my_custom_dataset") class MyCustomDataset(BaseDataset): def __init__(self, config): super().__init__(config) # Custom initialization def __getitem__(self, idx): # Custom data loading return item # Register a custom processor from lmms_engine.datasets.processor import register_processor @register_processor("my_custom_processor") class MyCustomProcessor: def __call__(self, raw_data): # Custom processing return processed_data

Training Pipeline

Builder Pattern for flexible composition:

from lmms_engine.train import TrainRunner # Configuration defines the pipeline runner = TrainRunner(config) runner.build() # Lazy initialization of components runner.run() # Execute training

Pipeline stages:

  1. Model initialization - From pretrained or config
  2. Dataset creation - With processor and collator
  3. Monkey patching - Apply kernel optimizations
  4. Trainer setup - FSDP2, DeepSpeed, or custom
  5. Training execution - With checkpointing and logging

Supported Trainers

Trainer Type Use Case Key Features
hf_trainer General VLM/LM training FSDP2, Muon, Liger, Flash Attn
dllm_trainer Diffusion language models Masked LM, custom loss, DLLM collator
wan_trainer Video generation Flow-matching, multi-modal inputs
rae_trainer Visual autoencoders Adversarial loss, EMA, LPIPS
sit_trainer Diffusion transformers Interpolant framework, CFG, EMA

🎯 Use Cases

  • Vision-Language Pretraining - Qwen-VL, LLaVA on large multimodal datasets
  • Video Understanding - AERO on 3D video data
  • Diffusion Models - DLLM, SiT, WanVideo for generation tasks
  • Representation Learning - RAE for visual representations
  • Language Model Pretraining - DGN, Qwen with Muon optimizer
  • Multimodal Fine-tuning - Efficient SFT with sequence packing

🤝 Contributing

We welcome contributions! Please see our Design Principles for coding guidelines:

  • Simplicity: Write simple, straightforward code
  • Readability: Prioritize clarity over cleverness
  • Testability: Create testable components
  • Minimal Changes: Only modify code related to the task
  • Less Code = Less Debt: Minimize code footprint

😊 Acknowledgement

Thanks to the following projects for their excellent work:

📝 Citation

If you use LMMs Engine in your research, please cite:

@software{lmms_engine2025, title={LMMs Engine: A simple, unified multimodal framework for pretraining and finetuning.}, author={LMMs-Lab}, year={2025}, url={https://github.com/LMMs-Lab/lmms-engine} }

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

🔗 Links

🎉 Awesome projects using LMMs-Engine

  • LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

  • OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Built with ❤️ by LMMs-Lab

Star us on GitHub to support the project!

About

A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.

Topics

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

Footer

© 2026 GitHub, Inc.