Can a large multimodal model learn to natively invoke tool calls in parallel under agentic RL — when the very prior that enables tool use also destabilizes it?
Long-video understanding is increasingly framed as agentic video reasoning: a large multimodal model (LMM) post-trained with reinforcement learning to invoke video-processing tools. Prior native-RL methods, including our earlier LongVT (CVPR 2026), dispatch these tool calls sequentially, one per turn, which is brittle to single mis-localizations, prone to multi-turn context drift, and linear in inference cost.
We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling: a main agent emits multiple temporal-window crops in a single turn, dispatches them to weight-sharing sub-agents, and aggregates the parallel evidence into a final answer. Applying standard GRPO to ParaVT surfaces two coupled failures driven by the same pretrained tool prior — Format Fragility (the SFT-learned structural tags collapse under temperature sampling) and the Tool Necessity Gap (the skip-tool reward shortcut). We name this trade-off the Tool Prior Paradox and tame it with PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO): a targeted format reward applied only at the structural-token positions most prone to collapse, paired with a per-prompt frame-budget randomization that lets calling the tool earn measurable RL credit. Across seven evaluation splits spanning six long-video benchmarks plus a temporal-grounding split, ParaVT sets a new open-source 7-8B SOTA on six of the seven, improving over the Qwen3-VL-8B base by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64.
A main agent emits one or more <tool_call> blocks in a single turn; each call is handled by an independent sub-agent that shares weights with the main agent and returns a short summary; the main agent aggregates the summaries into a final answer. PARA-GRPO targets the structural-format collapse and the skip-tool reward shortcut that vanilla GRPO surfaces on a tool-native LMM.
All open-source rows were re-evaluated under a unified protocol (image_url channel, 64 frames, per-baseline native prompt). Best result in bold; underlined marks ParaVT's value when it is not the best column-wise. * withheld due to benchmark–training-data overlap. † native tool-call schema not reconcilable with Charades-STA grounding output.
| Proprietary LMMs — best-setting numbers from official reports | |||||||
| GPT-4o | 71.9 | 77.2 | 66.7 | 34.7 | 64.6 | 66.7 | — |
| Gemini 1.5 Pro | 75.0 | 81.3 | 64.4 | 33.1 | 74.3 | 65.8 | — |
| Open Instruct LMMs — direct answer | |||||||
| Qwen2.5-VL-7B | 55.7 | 64.5 | 46.4 | 32.2 | 47.8 | 65.4 | 31.6 |
| Open Reasoning LMMs — <think> → <answer> | |||||||
| Video-R1-7B | 57.6 | 66.0 | 57.4 | 36.9 | 61.6 | 61.3 | 25.4 |
| VideoChat-R1-7B | 50.4 | 58.2 | 49.2 | 23.8 | 58.7 | 65.0 | 31.5 |
| VideoRFT-7B | 58.5 | 65.6 | 55.1 | 38.0 | 44.9 | 42.7 | 18.7 |
| Time-R1-7B | 58.9 | 66.2 | 56.0 | 38.2 | 60.5 | 63.4 | 34.7 |
| ReWatch-R1-7B | 58.8 | 65.0 | 53.6 | 38.5 | 60.1 | 59.8 | 20.2 |
| Video-Thinker-7B | 61.9 | 65.3 | 56.0 | * | 65.2 | 64.5 | 29.0 |
| Open Agentic LMMs — <think> → <tool_call> → <answer> | |||||||
| Qwen3-VL-8B | 59.9 | 68.4 | 52.2 | 33.1 | 58.3 | 68.0 | 49.3 |
| Conan-7B | 55.5 | 62.8 | 54.5 | 38.2 | 59.2 | 64.0 | 25.4 |
| LongVT-RFT-7B | 59.5 | 66.0 | 54.7 | 37.9 | 59.4 | 63.4 | 23.4 |
| SAGE-7B | 44.1 | 52.4 | 37.4 | 31.8 | 49.7 | 55.7 | 28.9 |
| VideoZoomer-7B | 45.3 | 48.3 | 39.6 | 22.9 | 46.2 | 61.6 | † |
| ParaVT-8B (Ours) | 62.1 | 69.4 | 60.4 | 39.8 | 65.0 | 68.6 | 50.1 |
Higher is better. ParaVT tops 6 of 7 columns and is within 0.2 pt of the best on MLVU. Avg over the seven columns: 59.3 for ParaVT vs. 55.6 for the Qwen3-VL-8B base (+7.9% relative).
Each row reports mean training-time format reward fτ at sampling temperature τ=0.7 and mean training-time tool-call rate per rollout κ. Curves on the right plot the same four runs step-by-step. Best result in bold; rows shaded grey mark the full PARA-GRPO recipe.
| (A) Training Stage | ||||||
| Qwen3-VL-8B | 0.03 | 0.45 | 59.9 | 68.4 | 33.1 | 58.3 |
| + SFT Cold-Start | 0.13 | 2.50 | 60.7 | 69.0 | 39.1 | 63.7 |
| + SFT + GRPO | 0.13 | 0.02 | 62.0 | 68.6 | 39.3 | 64.5 |
| + SFT + PARA-GRPO | 0.41 | 0.21 | 62.1 | 69.4 | 39.8 | 65.0 |
| (B) Component Effectiveness | ||||||
| SFT + GRPO | 0.13 | 0.02 | 62.0 | 68.6 | 39.3 | 64.5 |
| + Exploration Anchoring | 0.35 | 0.19 | 61.7 | 68.7 | 39.3 | 64.1 |
| + nFrames Gating | 0.10 | 1.36 | 61.3 | 68.7 | 39.1 | 63.6 |
| Full PARA-GRPO | 0.41 | 0.21 | 62.1 | 69.4 | 39.8 | 65.0 |
| − Tool Reward Rtool | 0.33 | 0.04 | 61.9 | 68.5 | 38.7 | 64.3 |
| − Penalty Term γ | 0.36 | 0.27 | 61.6 | 69.0 | 38.8 | 64.5 |
| (C) Dispatch Mode | ||||||
| Sequential Tool Calling | — | — | 61.4 | 68.8 | 37.5 | 64.1 |
| Parallel Tool Calling | — | — | 62.1 | 69.4 | 39.8 | 65.0 |
Reading the table. Block (A) shows that SFT cold-start mostly transfers tool-call format (κ: 0.45 → 2.50) but not format reward (fτ=0.13), and that GRPO on top of cold-start collapses κ to 0.02. PARA-GRPO recovers fτ to 0.41 and κ to 0.21, and posts the best score on every benchmark column. Block (B) confirms neither component alone suffices: Exploration Anchoring on its own holds format up but suppresses tool calls; nFrames Gating on its own keeps tools active but tanks format. Only their composition wins on both axes. Block (C) shows the same trained checkpoint dispatched in parallel beats sequential calling by +0.7 to +2.3 across benchmarks, with no extra training.
ParaVT builds on LongVT (CVPR 2026) for native video tool calling, lmms-engine for cold-start SFT, AReaL for RL training, and lmms-eval for evaluation. Page template adapted from WorldReasonBench.