LMMs-Lab · GitHub

LMMs-Lab

Feeling and building multimodal intelligence.

Overview
Repositories
Projects
Packages
People

README.md

LMMs-Lab: Building Multimodal Intelligence

We are a group of researchers, with a focus on large multimodal models (LMMs). We wish to bring insights to community with our research.

Discord

🏗️ Models & Training

LLaVA-OneVision 1.5 ⭐ 754

A fully open-source family of Large Multimodal Models achieving state-of-the-art performance at substantially lower cost. Trains on native resolution images with an end-to-end MegatronLM-based framework supporting MoE, FP8, and long sequence parallelization — all for under $16,000 on A100 GPUs. Outperforms Qwen2.5-VL on most benchmarks. Includes open pre-training & SFT data, training code, recipes, and full logs.

🤗 Models & Datasets | 🖥️ Demo | 📄 Tech Report

NEO ⭐ 653 ICLR 2026

NEO Series: Native Vision-Language Models built from first principles. Rethinks the multimodal architecture by deeply integrating vision and language capabilities within a dense, monolithic model architecture, rather than bolting a vision encoder onto a language model. With merely 390M image-text examples, NEO develops strong visual perception from scratch, rivaling top-tier modular VLMs and outperforming native ones.

📄 Paper | 🤗 Models

OneVision-Encoder ⭐ 269 CVPR 2025

A vision encoder designed around codec-aligned sparsity as a foundational principle for multimodal intelligence. Abandons uniform computation to selectively encode only 3.1%-25% of regions rich in signal entropy, consistently outperforming Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks despite using substantially fewer visual tokens.

🌐 Project Page | 📄 Paper | 🤗 Models

Otter ⭐ 3.3k IEEE TPAMI 2025

A multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo), trained on the MIMIC-IT dataset with 2.8M multimodal in-context instruction-response pairs. Demonstrates improved instruction-following and in-context learning capabilities across vision-language tasks and served as an early exploration into instruction-tuned multimodal models.

📄 Otter Paper | 📄 MIMIC-IT Paper | 🤗 Models | 🤗 MIMIC-IT Dataset

LongVA ⭐ 402 TMLR 2025

Transfers long-context capabilities from language to vision. LongVA can process 2000 frames or over 200K visual tokens, achieving state-of-the-art performance on Video-MME among 7B models — demonstrating that long context capability can zero-shot transfer from language to vision.

🌐 Blog | 📄 Paper | 🤗 Models | 🎥 Demo

RelateAnything ⭐ 462

The Relate Anything Model (RAM) takes an image as input and leverages SAM to identify corresponding masks, then reasons about relationships between any detected objects. Built on the Panoptic Scene Graph Generation work (ECCV 2022).

🤗 Demo | 📦 PSG Dataset

🧠 Reasoning & Reinforcement Learning

OpenR1-Multimodal ⭐ 1.5k

A speed-run investigation of R1's paradigm applied to multimodal models. Built on top of open-r1 and trl, this project adds multimodal model training with the GRPO algorithm, open-sourcing 8K multimodal RL training examples, trained models, and training scripts for community study on multimodal reasoning.

🤗 Models | 🤗 Datasets | 📊 Wandb Logs

OpenMMReasoner ⭐ 145 CVPR 2026

A fully transparent two-stage recipe (SFT + RL) for pushing the frontiers of multimodal reasoning. Constructs an 874K-sample cold-start dataset with step-by-step validation and a 74K-sample RL dataset, achieving 11.6% improvement over Qwen2.5-VL-7B-Instruct across nine multimodal reasoning benchmarks.

📄 Paper | 🌐 Project Page | 🤗 Models | 🤗 Data | 🌐 Blog

MMSearch-R1 ⭐ 402

An end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools. Integrates both image and text search capabilities, training models to autonomously reason about when and how to invoke external search tools.

📄 Paper | 🌐 Blog | 🤗 Model | 🤗 Data

LongVT ⭐ 195 CVPR 2026

Incentivizes "Thinking with Long Videos" via native tool calling. LongVT exploits LMMs' inherent temporal grounding ability as a native video cropping tool, enabling a global-to-local reasoning loop where the model skims globally and examines relevant clips for details until answers are grounded in visual evidence.

📊 Evaluation & Analysis

LMMS-Eval ⭐ 3.8k

The unified evaluation toolkit for large multimodal models, covering 100+ tasks across text, image, video, and audio. Supports 30+ models with reproducible, efficient, and statistically grounded benchmarking. Available on PyPI and translated into 17 languages.

🏠 Homepage | 📚 Documentation | 📦 PyPI

Multimodal-SAE ⭐ 183 ICCV 2025

For the first time in the multimodal domain, demonstrates that features learned by Sparse Autoencoders (SAEs) in a smaller LMM can be interpreted by a larger LMM. Provides a complete auto-interpretation pipeline for analyzing open-semantic features and steering model behavior.

📄 Paper | 🤗 Models & Data

🔬 Training Frameworks

LMMs-Engine ⭐ 735

A simple, unified multimodal model training engine. Supports FSDP2, USP, Muon optimizer, Liger kernel, packing, and expert parallelism across models like Qwen2.5-VL, Qwen3-VL, BAGEL, WanVideo, and more. Lean, flexible, and built for hacking at scale.

🐳 Docker | 📦 PyPI

🌍 Datasets & Benchmarks

EgoLife ⭐ 399 CVPR 2025

For one week, six individuals lived together, capturing every moment through AI glasses, creating the EgoLife dataset. Includes EgoGPT (omni-modal clip-level understanding) and EgoRAG (long-context QA with hierarchical memory). Built to drive the future of egocentric AI life assistants.

📄 Paper | 🌐 Project Page | 🤗 Data

Pinned Loading

lmms-eval lmms-eval Public

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

Python 4.2k 591
Otter Otter Public

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

Python 3.4k 210
LongVA LongVA Public

Long Context Transfer from Language to Vision

Python 404 18
multimodal-sae multimodal-sae Public

[ICCV 2025] Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.

Python 198 11
open-r1-multimodal open-r1-multimodal Public

A fork to add multimodal model training to open-r1

Python 1.5k 72
EgoLife EgoLife Public

[CVPR 2025] EgoLife: Towards Egocentric Life Assistant

Python 428 19

Repositories

Type Language Sort

LLaVA-OneVision-2 Public
Fully Open Framework for Democratized Multimodal Training

EvolvingLMMs-Lab/LLaVA-OneVision-2’s past year of commit activity

Python 919 Apache-2.0 70 45 10 Updated May 24, 2026
lmms-engine Public
A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.

EvolvingLMMs-Lab/lmms-engine’s past year of commit activity

Python 778 35 10 0 Updated May 22, 2026
lmms-eval Public
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

EvolvingLMMs-Lab/lmms-eval’s past year of commit activity

Python 4,155 591 25 14 Updated May 22, 2026
ParaVT Public
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

EvolvingLMMs-Lab/ParaVT’s past year of commit activity

Python 9 Apache-2.0 0 0 0 Updated May 21, 2026
LongVT Public
[CVPR 2026] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

EvolvingLMMs-Lab/LongVT’s past year of commit activity

Python 232 Apache-2.0 13 3 (1 issue needs help) 0 Updated May 19, 2026
OneVision-Encoder Public
Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

EvolvingLMMs-Lab/OneVision-Encoder’s past year of commit activity

Python 354 Apache-2.0 19 12 3 Updated May 13, 2026
Evolving-Visual-Generation Public
[Roadmap] Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

EvolvingLMMs-Lab/Evolving-Visual-Generation’s past year of commit activity

TeX 105 4 0 0 Updated May 13, 2026
EASI Public
Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

EvolvingLMMs-Lab/EASI’s past year of commit activity

Python 110 Apache-2.0 8 2 1 Updated May 11, 2026
VLMEvalKit Public
An open-source evaluation toolkit to evaluate MLLMs on Spatial Intelligence using the EASI protocol

EvolvingLMMs-Lab/VLMEvalKit’s past year of commit activity

Python 19 Apache-2.0 1 0 0 Updated May 8, 2026
SimpleStream Public
A simple video streaming baseline that outperforms SOTAs.

EvolvingLMMs-Lab/SimpleStream’s past year of commit activity

Python 131 6 1 0 Updated May 1, 2026

View all repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…

Terms
Privacy
Security
Status
Community
Docs
Contact
Manage cookies
Do not share my personal information

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMMs-Lab

LMMs-Lab: Building Multimodal Intelligence

🏗️ Models & Training

LLaVA-OneVision 1.5 ⭐ 754

NEO ⭐ 653 ICLR 2026

OneVision-Encoder ⭐ 269 CVPR 2025

Otter ⭐ 3.3k IEEE TPAMI 2025

LongVA ⭐ 402 TMLR 2025

RelateAnything ⭐ 462

🧠 Reasoning & Reinforcement Learning

OpenR1-Multimodal ⭐ 1.5k

OpenMMReasoner ⭐ 145 CVPR 2026

MMSearch-R1 ⭐ 402

LongVT ⭐ 195 CVPR 2026

📊 Evaluation & Analysis

LMMS-Eval ⭐ 3.8k

Multimodal-SAE ⭐ 183 ICCV 2025

🔬 Training Frameworks

LMMs-Engine ⭐ 735

🌍 Datasets & Benchmarks

EgoLife ⭐ 399 CVPR 2025

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMMs-Lab

LMMs-Lab: Building Multimodal Intelligence

🏗️ Models & Training

LLaVA-OneVision 1.5 ⭐ 754

NEO ⭐ 653 ICLR 2026

OneVision-Encoder ⭐ 269 CVPR 2025

Otter ⭐ 3.3k IEEE TPAMI 2025

LongVA ⭐ 402 TMLR 2025

RelateAnything ⭐ 462

🧠 Reasoning & Reinforcement Learning

OpenR1-Multimodal ⭐ 1.5k

OpenMMReasoner ⭐ 145 CVPR 2026

MMSearch-R1 ⭐ 402

LongVT ⭐ 195 CVPR 2026

📊 Evaluation & Analysis

LMMS-Eval ⭐ 3.8k

Multimodal-SAE ⭐ 183 ICCV 2025

🔬 Training Frameworks

LMMs-Engine ⭐ 735

🌍 Datasets & Benchmarks

EgoLife ⭐ 399 CVPR 2025

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!

Footer

Footer navigation