This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Website source code based on the Nerfies project page. If you want to reuse their source code, please credit them appropriately.
Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.
Overview of VChain. We introduce VChain, an inference-time tuning framework for reasoning in video generation. Given a user-provided prompt (e.g., “A rock and a feather are falling from the sky towards the ground.”), VChain leverages large multimodal models to generate a Chain of Visual Thoughts, which are a sparse set of causally important keyframes to guide the video generator via Sparse Inference-Time Tuning. VChain effectively improves reasoning in video generation without extensive re-training.
An overview of our three-stage inference-time pipeline for reasoning in video generation. (a) Visual Thought Reasoning: Given a user-provided text prompt, a large multimodal model (GPT-4o) infers a causal chain of events and generates a sequence of keyframes, termed the Chain of Visual Thoughts, via iterative reasoning and image synthesis. (b) Sparse Inference-Time Tuning: These visual thoughts (paired with their corresponding textual thoughts) serve as sparse supervision for fine-tuning a pre-trained video generator via LoRA. (c) Video Sampling: The full sequence of textual thoughts is concatenated to form a single prompt, which is used to prompt the fine-tuned model in generating the final video output.
We compare VChain with baselines and ablation variants, .... (more coming soon)
Input Prompt: A rock and a feather falling from the sky towards the ground.
Input Prompt: An ice cream cone is left out in the sun.
Input Prompt: A rubber duck and a rock fall into a water tank.
Input Prompt: A steel ball is dropped into water.
If you find our work useful, please consider citing our paper:
@InProceedings{huang2026vchain, title={{VChain}: Chain-of-Visual-Thought for Reasoning in Video Generation}, author = {Huang, Ziqi and Yu, Ning and Chen, Gordon and Qiu, Haonan and Debevec, Paul and Liu, Ziwei}, booktitle={Annual Meeting of the Association for Computational Linguistics (ACL Findings), 2026}, year={2026} }This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Website source code based on the Nerfies project page. If you want to reuse their source code, please credit them appropriately.