Figure: MLLMs know when during prefill but forget during decoding. During prefill (left), attention from query tokens peaks at the ground-truth interval. During decoding (right), attention from the generated answer tokens drifts away to a visually salient but query-irrelevant segment.
Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable. We probe the cross-modal attention of MLLMs and uncover a striking perception-generation gap: the model often already knows the correct temporal interval during prefill, but loses this signal when generating the final answer.
We hypothesize this gap arises from an asymmetry between reading and speaking. During prefill, the full query provides focused linguistic intent, allowing TG-Heads to bind discriminative words to relevant frames. During decoding, numeric timestamp tokens provide little semantic guidance while attending over hundreds of visual tokens, diluting the localized signal.