ICML 2026
1University of California, Merced 2Adobe Research
*Work done as an intern at Adobe
MotiMotion reformulates motion control as a reasoning-then-generation problem. It refines sparse user trajectories, predicts plausible secondary motion, and uses confidence-aware control to balance faithfulness and realism.
Abstract
Current motion-controlled image-to-video models often treat user-provided trajectories as literal commands, even when those inputs are sparse, imprecise, or causally incomplete. MotiMotion introduces a visual-language reasoner that refines primary motion, hallucinates plausible secondary effects, and guides a confidence-aware video generator to produce more natural and physically grounded outcomes.
Method
A training-free visual-language reasoner interprets the input image, trajectories, and optional text prompt to refine primary motion and hallucinate plausible secondary effects. These enriched plans guide a video generator to produce causally grounded videos.
MotiMotion uses a training-free visual-language reasoner to refine user intent, infer missing motion, and improve controllability for complex interactions.Method
We assign a confidence score to each trajectory and modulate the motion conditioning strength accordingly. High-confidence inputs are followed strictly; low-confidence inputs let the generative prior fill in natural dynamics.
Images from MoveBench and MotionEdit.
MotiBench
Each sample pairs an image with a prompt that requires plausible causal effects and world-aware motion understanding.
Results
Red lines indicate user trajectories. Blue lines indicate predicted trajectories.
Comparison
Confidence-Aware Control
We set user trajectories (red lines) to high-confidence and show the difference between predicted trajectories (blue lines) under high and low-confidence.
Citation