What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom
TL;DR: Vision tool-use RL enhances model performance by reducing tool-induced harm, but does not significantly improve tool-based correction of intrinsic failures.
This repository provides the MED (Measure-Explain-Diagnose) framework for analyzing vision tool-use reinforcement learning. We decompose performance improvements into intrinsic capability changes and tool-induced effects, providing fine-grained insights into what vision RL truly learns.
- Performance gains are primarily driven by intrinsic learning - Models improve their base reasoning capabilities
- Tool-use RL mainly reduces tool-induced harm - Reduces errors from tool invocation and weakens tool pattern interference
- Limited improvement in tool-based correction - Tools don't significantly improve correction of intrinsic failures
- Current vision RL learns to "safely coexist with tools" - Rather than fully mastering their strategic use
The MED framework provides a coarse-to-fine analysis of vision tool-use reinforcement learning through three sequential steps:
Measure
Quantify tool-induced drift by decomposing tool-available drift into intrinsic and tool-induced components
|
Explain
Decompose tool-induced performance gap into Gross Gain and Gross Harm via 4-term analysis
|
Diagnose
Factorize each term into Mass, Policy, and Quality to probe root causes of term evolution
|
This repository contains the core methodology from our paper (Section 3), including:
- 4-term decomposition - Call Gain, Schema Gain, Call Harm, Schema Harm
- Factor analysis - Decompose each term into Mass (domain size), Policy (when to call), Quality (how to use)
- Visualization tools - Generate all figures (Measure, Explain, Diagnose) from the paper
# Install uv package manager (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/GAIR-NLP/Med.git
cd Med
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e .
# For training environment (includes torch, transformers, flash-attn, etc.)
uv pip install -e ".[train]"
Requirements: Python 3.11+, uv package manager
The training dataset (~15k samples from 12 data sources) is available on HuggingFace:
hf download Med2026/Med_training_data --repo-type dataset --local-dir data/Med_training_data/
Click to expand
For distributed training, set up a Ray cluster. Here is an example for a 2-node cluster, each with 8 GPUs.
Start the Head Node: Run this command on your designated head node. The dashboard will be accessible at http://<head_node_ip>:8265.
ray start --head --dashboard-host=0.0.0.0
Note down the address provided (e.g., xxxxxx:6379).
Start Worker Node(s): Run this command on each worker node, replacing xxxxxx:6379 with the address from the head node.
ray start --address=xxxxxx:6379
Verify Cluster Status: On the head node, run ray status to confirm that all nodes have joined and all GPUs (16 in this example) are detected.
Click to expand
The reward server is a remote FastAPI service used to calculate reward values during training.
To start the reward server:
bash recipe/med/scripts/reward_server.sh
- PORT: Specifies the network port on which the reward server will listen for incoming requests.
- WORKERS: Sets the number of worker processes for the server.
Upon successful launch, a file named with a unique JOB_ID will be created in the .reward_server/ directory. This file contains the IP address and port of the running server (e.g., your_server_ip:8192).
Note: Take note of this JOB_ID, as it is required for configuring REMOTE_REWARD_JOB_ID in the training script.
Click to expand
For a comprehensive list of all configurable parameters and hyperparameters, please refer to recipe/med/scripts/train.sh. Before running experiments, configure the following environment variables to match your setup.
Set Base Directory and Python Path: Point BASE_DIR to your cloned repository root so that all scripts can locate configs and modules correctly.
export BASE_DIR="/path/to/Med"
export PYTHONPATH=${BASE_DIR}/:${PYTHONPATH}
Set Node and GPU Counts: Adjust these values based on your actual cluster configuration (e.g., for 2 nodes with 8 GPUs each):
export NUM_NODES=2
export GPUS_PER_NODE=8
Configure Reward Server Job ID: Set REMOTE_REWARD_JOB_ID to the identifier(s) of your previously launched reward server(s). This enables the training pipeline to locate the reward server's address.
export REMOTE_REWARD_JOB_ID="j-xxxxxxxxxx"
Set Training Data:
export DATA_TRAIN_FILE="[/path/to/your/data/Med_training_data/train-00000-of-00030.parquet]"
Model Loading and Checkpointing: Configure paths for loading initial model weights and saving training states, along with the save frequency.
- ACTOR_LOAD_PATH: Path to the initial model checkpoint to load.
- TRAIN_SAVE_FREQ: Frequency to save the training state (e.g., 5 for every 5 steps, -1 to disable saving).
- TRAIN_SAVE_PATH: Directory where training checkpoints will be stored.
export ACTOR_LOAD_PATH="/path/to/Qwen2.5-VL-7B-Instruct"
export TRAIN_SAVE_FREQ=10
export TRAIN_SAVE_PATH="/path/to/checkpoints"
Set Wandb API Key: Required for logging training metrics to Weights & Biases.
export WANDB_API_KEY="your-wandb-api-key"
Start Training: First serve the vision tool, then launch the training script. The entry point recipe/med/scripts/run.sh handles this sequence automatically:
bash recipe/med/scripts/run.sh
This script will:
- Verify Ray cluster status
- Start the vision tool server (recipe/med/scripts/serve_vision_tool.sh)
- Launch the training pipeline (recipe/med/scripts/train.sh)
The evaluation dataset is available on HuggingFace:
hf download Med2026/Med_eval_data --repo-type dataset --local-dir data/Med_eval_data/
Click to expand
For the full list of evaluation parameters, please refer to recipe/med/scripts/eval.sh.
Evaluation shares the same infrastructure setup as training:
- Reuse the Ray cluster setup from the Training section
- Reuse the same reward server and REMOTE_REWARD_JOB_ID
- Reuse the common environment setup for BASE_DIR, PYTHONPATH
- set your NUM_NODES and GPUS_PER_NODE
Set Model and Output Paths: Specify the trained checkpoint to evaluate, the verifier model path, and the output directory for evaluation results.
export ACTOR_LOAD_PATH="/path/to/your/checkpoint"
export VERIFICATION_LOAD_PATH="/path/to/your/verification_llm/checkpoint"
export OUTPUT_DIR="/path/to/evaluation_results"
export EXP_NAME="your_eval_name"
Set Evaluation Data: DATA_VAL_FILE accepts one or more parquet files. In the example below, tool_agent denotes tool-available evaluation and single_turn denotes tool-free evaluation.
export DATA_VAL_FILE="[/path/to/Med_eval_data/vstar_bench_tool_agent_format_0.0_length_0.0_maxlen_10564_num_191.parquet,/path/to/Med_eval_data/vstar_bench_single_turn_format_0.0_length_0.0_maxlen_10564_num_191.parquet]"
Start Evaluation: First serve the vision tool, then run the evaluation script.
bash recipe/med/scripts/serve_vision_tool.sh
bash recipe/med/scripts/eval.sh
The evaluation outputs will be written to OUTPUT_DIR (default: ./evaluation_results).
Reproducing Paper Figures
Step 1: Download Evaluation Logs
Click to expand
Download the evaluation logs from HuggingFace:
# Using HuggingFace CLI
hf download Med2026/Med-eval-logs --repo-type dataset --local-dir evals/
# Or using Python API
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Med2026/Med-eval-logs", repo_type="dataset", local_dir="evals/")
This downloads evaluation results for 6 perception benchmarks across 21 training checkpoints:
- VStar
- HRBench (4k)
- HRBench (8k)
- VisualProb (easy)
- VisualProb (medium)
- VisualProb (hard)
Step 2: Generate CSV Data
Click to expand
Extract metrics from evaluation logs:
bash scripts/run_create_csv.sh
This creates CSV files in each eval logs with performance metrics, 4-term decomposition, and factor analysis across all checkpoints.
Step 3: Generate Paper Figures
Click to expand
Generate all figures using the plotting script:
bash scripts/run_plot_paper_figures.sh
This generates two types of figures in the figures/ directory:
Aggregated figures (averaged across all 6 benchmarks):
- {exp_name}_measure.pdf - MEASURE: Intrinsic vs tool-induced drift over training
- {exp_name}_explain.pdf - EXPLAIN: 4-term decomposition (Call/Schema Gain/Harm)
- {exp_name}_diagnose.pdf - DIAGNOSE: Factor analysis (Mass × Policy × Quality)
Per-benchmark figures (individual benchmark breakdowns):
- {exp_name}_per_bench_exp{N}_measure.pdf - MEASURE for each benchmark
- {exp_name}_per_bench_exp{N}_explain.pdf - EXPLAIN for each benchmark
- {exp_name}_per_bench_exp{N}_diagnose.pdf - DIAGNOSE for each benchmark
Understanding the Results
The MED framework provides three levels of analysis, each visualized in separate figures:
MEASURE: Quantifying Drift Components
Click to expand
The MEASURE figure decomposes tool-available drift fw(t) into two components:
- Grey area: Intrinsic drift fwo(t) - performance change without tool access
- Colored area: Tool-induced drift Δtool(t) - change in tool-induced performance gap
- Green: positive relative gain (fw > fwo)
- Red: negative relative drift (fwo > fw)
- Color intensity: tool call rate
Tool contribution ratio Stool (top progress bar): fraction of total drift magnitude from tool effects
Key finding: Tool-induced effects account for only ~20-30% of total improvement. Most gains come from intrinsic capability improvements.
EXPLAIN: 4-Term Decomposition
Click to expand
The EXPLAIN figure decomposes the tool-induced performance gap G(t) = Accw(t) - Accwo(t) into:
Gross Gain (green, positive contributions):
- Call Gain (Term 1): Intrinsic failures corrected by tool execution
- Schema Gain (Term 2): Intrinsic failures recovered under tool schema without invocation
Gross Harm (red, negative contributions):
- Call Harm (Term 3): Intrinsic successes lost due to tool calls
- Schema Harm (Term 4): Intrinsic successes lost under tool schema without invocation
Net gap G(t) (yellow diamonds): Call Gain + Schema Gain - Call Harm - Schema Harm
Key finding: Gross Gain stagnates (Call Gain plateaus) while Gross Harm decreases consistently, indicating RL primarily reduces tool-induced harm rather than maximizing tool-based correction.
DIAGNOSE: Factor Analysis
Click to expand
The DIAGNOSE figure factorizes each of the four terms into:
- Mass (grey): Domain size P(D) - capacity for gain/harm
- Policy (blue): Calling probability P(call|D) - when to use the tool
- Quality (orange): Success rate P(✓|call,D) - how well the tool is used
Thick line: Term value (left axis)
Thin lines: Individual factors (right axis)
Key findings:
- Limited failure correction: Call Gain quality P(✓|call, failures) shows little improvement on current and persistent failure sets
- Reduced breakage: Call Harm quality P(✗|call, successes) decreases, indicating fewer errors on already-solved instances
- Schema interference mitigation: Schema Harm decreases as model becomes less sensitive to tool prompt
Current vision tool-use RL learns to safely coexist with tools rather than master them:
- Tool effects contribute minimally (~20-30%) compared to intrinsic improvements
- RL primarily reduces harm (fewer tool-induced errors) rather than increasing gain (better failure correction)
- Models improve at not breaking existing capabilities, but show limited progress in using tools to fix hard cases
If you find this work helpful, please cite our paper:
@article{ma2026does,
title={What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom},
author={Ma, Yan and Zhang, Weiyu and Li, Tianle and Du, Linge and Shen, Xuyang and Liu, Pengfei},
journal={arXiv preprint arXiv:2602.01334},
year={2026}
}
We are progressively open-sourcing components of the MED project:
- Evaluation logs - Available at HuggingFace
- Analysis code - MED framework implementation (recipe/med/analysis_plot/)
- Training data - Available at HuggingFace
- Training code - GRPO-based RL training pipeline (recipe/med/)
- Evaluation data - Available at HuggingFace
- Evaluation code - Evaluation pipeline for tool-free and tool-available protocols (recipe/med/eval/, recipe/med/scripts/eval.sh)
Stay tuned for updates!
This project is licensed under the MIT License - see the LICENSE file for details.