We are a group of researchers, with a focus on large multimodal models (LMMs). We wish to bring insights to community with our research.
A fully open-source family of Large Multimodal Models achieving state-of-the-art performance at substantially lower cost. Trains on native resolution images with an end-to-end MegatronLM-based framework supporting MoE, FP8, and long sequence parallelization — all for under $16,000 on A100 GPUs. Outperforms Qwen2.5-VL on most benchmarks. Includes open pre-training & SFT data, training code, recipes, and full logs.
🤗 Models & Datasets | 🖥️ Demo | 📄 Tech Report
NEO Series: Native Vision-Language Models built from first principles. Rethinks the multimodal architecture by deeply integrating vision and language capabilities within a dense, monolithic model architecture, rather than bolting a vision encoder onto a language model. With merely 390M image-text examples, NEO develops strong visual perception from scratch, rivaling top-tier modular VLMs and outperforming native ones.
A vision encoder designed around codec-aligned sparsity as a foundational principle for multimodal intelligence. Abandons uniform computation to selectively encode only 3.1%-25% of regions rich in signal entropy, consistently outperforming Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks despite using substantially fewer visual tokens.
🌐 Project Page | 📄 Paper | 🤗 Models
A multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo), trained on the MIMIC-IT dataset with 2.8M multimodal in-context instruction-response pairs. Demonstrates improved instruction-following and in-context learning capabilities across vision-language tasks and served as an early exploration into instruction-tuned multimodal models.
📄 Otter Paper | 📄 MIMIC-IT Paper | 🤗 Models | 🤗 MIMIC-IT Dataset
Transfers long-context capabilities from language to vision. LongVA can process 2000 frames or over 200K visual tokens, achieving state-of-the-art performance on Video-MME among 7B models — demonstrating that long context capability can zero-shot transfer from language to vision.
🌐 Blog | 📄 Paper | 🤗 Models | 🎥 Demo
The Relate Anything Model (RAM) takes an image as input and leverages SAM to identify corresponding masks, then reasons about relationships between any detected objects. Built on the Panoptic Scene Graph Generation work (ECCV 2022).
🤗 Demo | 📦 PSG Dataset
A speed-run investigation of R1's paradigm applied to multimodal models. Built on top of open-r1 and trl, this project adds multimodal model training with the GRPO algorithm, open-sourcing 8K multimodal RL training examples, trained models, and training scripts for community study on multimodal reasoning.
🤗 Models | 🤗 Datasets | 📊 Wandb Logs
A fully transparent two-stage recipe (SFT + RL) for pushing the frontiers of multimodal reasoning. Constructs an 874K-sample cold-start dataset with step-by-step validation and a 74K-sample RL dataset, achieving 11.6% improvement over Qwen2.5-VL-7B-Instruct across nine multimodal reasoning benchmarks.
📄 Paper | 🌐 Project Page | 🤗 Models | 🤗 Data | 🌐 Blog
An end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools. Integrates both image and text search capabilities, training models to autonomously reason about when and how to invoke external search tools.
📄 Paper | 🌐 Blog | 🤗 Model | 🤗 Data
Incentivizes "Thinking with Long Videos" via native tool calling. LongVT exploits LMMs' inherent temporal grounding ability as a native video cropping tool, enabling a global-to-local reasoning loop where the model skims globally and examines relevant clips for details until answers are grounded in visual evidence.
📄 Paper | 🌐 Project Page | 🤗 Models | 🤗 Data | 🖥️ Demo | 🌐 Blog
The unified evaluation toolkit for large multimodal models, covering 100+ tasks across text, image, video, and audio. Supports 30+ models with reproducible, efficient, and statistically grounded benchmarking. Available on PyPI and translated into 17 languages.
🏠 Homepage | 📚 Documentation | 📦 PyPI
For the first time in the multimodal domain, demonstrates that features learned by Sparse Autoencoders (SAEs) in a smaller LMM can be interpreted by a larger LMM. Provides a complete auto-interpretation pipeline for analyzing open-semantic features and steering model behavior.
📄 Paper | 🤗 Models & Data
A simple, unified multimodal model training engine. Supports FSDP2, USP, Muon optimizer, Liger kernel, packing, and expert parallelism across models like Qwen2.5-VL, Qwen3-VL, BAGEL, WanVideo, and more. Lean, flexible, and built for hacking at scale.
For one week, six individuals lived together, capturing every moment through AI glasses, creating the EgoLife dataset. Includes EgoGPT (omni-modal clip-level understanding) and EgoRAG (long-context QA with hierarchical memory). Built to drive the future of egocentric AI life assistants.
📄 Paper | 🌐 Project Page | 🤗 Data
A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
[Roadmap] Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
An open-source evaluation toolkit to evaluate MLLMs on Spatial Intelligence using the EASI protocol
This organization has no public members. You must be a member to see who’s a part of this organization.
Loading…
Loading…