View all files | ||||
Official code release for the paper
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning. Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao.
This repository reproduces the four training methods (OPSD, EOPD, REOPOLD, PW-OPSD) and the long-context reasoning evaluation suite (MATH-500, AIME 2024, AIME 2025, HMMT 2025) on three base models: Qwen3-4B, DeepSeek-R1-Distill-Llama-8B, and Olmo-3-7B-Think.
Entries report mean ± across-seed sample standard deviation over three evaluation seeds. The Avg@12 column is the equal-weight mean of the four per-benchmark Avg@12 values.
| OPSD | 95.33 ± 0.08 | 75.19 ± 0.42 | 66.67 ± 1.27 | 43.89 ± 0.73 | 70.27 |
| EOPD | 95.33 ± 0.08 | 73.61 ± 2.17 | 65.65 ± 0.89 | 41.94 ± 2.00 | 69.13 |
| REOPOLD | 95.09 ± 0.09 | 73.98 ± 0.42 | 62.13 ± 1.67 | 41.39 ± 1.44 | 68.15 |
| PW-OPSD Moderate (ours) | 95.34 ± 0.10 | 76.20 ± 0.58 | 67.78 ± 1.27 | 43.33 ± 1.21 | 70.66 |
| PW-OPSD Aggressive (ours) | 95.53 ± 0.04 | 75.19 ± 0.85 | 67.59 ± 0.85 | 45.37 ± 0.58 | 70.92 |
| OPSD | 89.02 ± 0.19 | 40.74 ± 2.58 | 31.57 ± 1.28 | 20.93 ± 0.80 | 45.56 |
| EOPD | 88.84 ± 0.30 | 43.06 ± 1.69 | 30.65 ± 0.58 | 20.93 ± 1.43 | 45.87 |
| REOPOLD | 88.59 ± 0.36 | 43.43 ± 2.74 | 30.65 ± 1.40 | 18.98 ± 0.42 | 45.41 |
| PW-OPSD Moderate (ours) | 88.72 ± 0.25 | 41.20 ± 0.32 | 32.22 ± 0.56 | 21.48 ± 0.42 | 45.91 |
Across both base models, PW-OPSD with the Moderate schedule (w_min, tau, s) = (0.25, 0.30, 0.10) (held fixed across models) delivers the highest Avg@12 among the four compared methods.
| OPSD | opsd | Uniform forward-KL self-distillation (Zhao et al., 2026). |
| EOPD | eopd | Entropy-conditioned RKL/FKL mixture (Jin et al., 2026). |
| REOPOLD | reopold | Relaxed on-policy distillation, policy-gradient form (Ko et al., 2026). |
| PW-OPSD | pwopsd | Ours — position-weighted FKL with per-sequence reduction. |
The launchers resolve REPO_ROOT automatically from the launcher script's own location (one directory above launchers/), so they should be run from the repository top level or via an absolute path:
Each launcher invokes ea_opsd_train.py with the shared hyperparameters (LoRA r=64, alpha=128, lr=5e-6, distillation temperature 1.1, forward-KL clip 0.05, fixed teacher) and dispatches to the corresponding method via the --method flag. The PW-OPSD launcher additionally passes the Moderate schedule --position_w_min 0.25 --position_tau 0.30 --position_s 0.10, and the REOPOLD launcher passes the published reward-floor and phase-mask hyperparameters.
All trainings use effective batch size 32 on 4 GPUs and run for 150 optimizer steps with a save every 25 steps. The launchers merge and evaluate checkpoint-100 — the paper protocol selects the step-100 snapshot for all reported numbers. The per-GPU batch / gradient-accumulation split is chosen per base-model size (Qwen3-4B uses bs=4 x ga=2, DSR1-L8B and Olmo-3-7B-Think use bs=2 x ga=4). The Moderate position schedule itself is identical across the three models.
Each launcher trains the LoRA adapter at outputs/<run_name>/ and then merges it into a single HuggingFace-loadable checkpoint at outputs/<run_name>_merged/, then writes a .ready_for_eval sentinel. <run_name> follows {model_tag}_{method} (e.g. qwen3_4b_base_pwopsd, dsr1_l8b_pwopsd, olmo3_7b_think_pwopsd).
DSR1-Llama-8B and Olmo-3-7B-Think are released as reasoning-distilled models whose chat templates unconditionally append an open <think> tag and ignore enable_thinking=False. To match OPSD's non-thinking-mode prompt format on these bases, the data collator detects an unclosed trailing <think> and appends </think>, restoring the intended enable_thinking=False semantics. The patch is a no-op for templates (e.g. Qwen3-4B base) that already honor the flag.
Accepted <method_name> values are any of the <run_name>s listed above; accepted <dataset> values are math500, aime24, aime25, hmmt25. The eval launcher writes JSONs of the form eval_results/<method_name>_38k.json (or eval_results/<method_name>_<dataset>_38k.json) and per-run logs under eval_results/logs/. Override HOSTLABEL to disambiguate logs across parallel workers.
evaluate_math.py reports Pass@12 / Avg@12 / Maj@12 with math-equivalence clustering (sympy + math_verify). The INVALID cluster participates in the plurality count and is scored incorrect when selected (formal definitions in the paper appendix).
recompute_majn.py re-aggregates Maj@12 over existing eval JSONs using the same fixed clustering rule.
The launchers default HF_HUB_OFFLINE=0 and TRANSFORMERS_OFFLINE=0, so Hugging Face downloads happen on first use; export HF_HUB_OFFLINE=1 / TRANSFORMERS_OFFLINE=1 to force the offline cache once the assets are downloaded. The training launchers pull the base model identified by MODEL=... inside the script (Qwen/Qwen3-4B, deepseek-ai/DeepSeek-R1-Distill-Llama-8B, or allenai/Olmo-3-7B-Think) and the dataset selected by PWOPSD_TRAIN_DATASET. The MATH-500, AIME 2024, and AIME 2025 benchmarks are downloaded automatically by evaluate_math.py from their Hugging Face dataset IDs; only HMMT 2025 needs the local parquet described above.
All paper experiments use 4×H100 80GB.
This code is released under the MIT License; see LICENSE for the full text. Upstream model and dataset assets remain under their original licenses as declared on Hugging Face and GitHub.