Figure: (a) A static-shortcut model answers correctly without tracking motion. (b) Accuracy on dynamic tasks strongly correlates with spatiotemporal sensitivity (Pearson r = −0.87).
Current Video LLMs often answer correctly by exploiting static shortcuts—single-frame cues and language priors—rather than tracking how events unfold over time. This problem becomes especially consequential in RL post-training: GRPO-style RL typically relies on correctness-only rewards, so if a single frame or a language-based guess is enough to answer a training question, the policy can receive high reward without tracking video dynamics.
We call this property spatiotemporal sensitivity: equivariance for dynamic questions and invariance for static questions. Such a behavioral signature is difficult for static shortcut policies to satisfy consistently.