View all files | ||||
Yiqiao Jin¹*, Yiyang Wang¹*, Lucheng Fu¹, Yijia Xiao², Yinyi Luo³, Haoxin Liu¹, B. Aditya Prakash¹, Josiah Hester¹, Jindong Wang⁴†, Srijan Kumar¹†
¹ Georgia Institute of Technology · ² UCLA · ³ Carnegie Mellon University · ⁴ William & Mary
* Equal contribution · † Corresponding authors
Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a Unified framework to systematically study Self-Distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSD*, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 and the strongest baseline by +2.8. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.
UniSD is built from five complementary mechanisms that can be enabled independently or composed into the integrated UniSD* recipe.
| 🤝 Multi-Teacher Agreement (sequence-level) | agreement_seq_{random,retrieval,induction} | --num-auxiliary-contexts, --gamma_agreement |
| 🎯 Multi-Teacher Agreement (token-level) | agreement_tok_{random,retrieval,induction} | --num-auxiliary-contexts, --gamma_agreement, --agreement_stat |
| 🌊 EMA Teacher Stabilization | ema | --ref_model_sync_steps, --ref_model_mixup_beta |
| ⚖️ Token-Level Contrastive Learning | contrastive | --contrastive_weight, --contrastive_margin |
| 🧠 Feature Matching | match_joint / match_repr | --final_layer_distill_weight |
| ✂️ Divergence Clipping (JSD-Clip) | clip | --alpha, --token_clip |
| ⭐ UniSD* (integrated) | unisd_star | combines EMA + matching + contrastive + agreement |
UniSD targets Python 3.12 + CUDA 12.8 (cu128 wheels). The install has a few prerequisite steps before the final pip install -r requirements.txt, because (a) PyTorch's cu128 build lives on the PyTorch wheel index and (b) flash-attention-2 must be compiled against the installed torch.
💡 Don't have /usr/local/cuda-12.6? Any CUDA 12.x toolkit (12.4–12.8) works. Run ls -d /usr/local/cuda-12* to see what's available and set CUDA_HOME to that path.
⚠️ trl ↔ vLLM compatibility: this environment ships trl==1.4.0 (officially supports vLLM 0.12.0–0.18.0) with vllm==0.20.2. The combination works in our smoke tests but trl will print a warning at import time. If you hit a runtime error from VLLMClient, pin vllm<0.19.
Optional environment variables: WANDB_API_KEY (logging), HF_TOKEN (gated models).
UniSD provides two ways to launch training: a high-level orchestrator with sane defaults, and a direct command for full per-flag control.
scripts/run_experiments.py handles GPU scheduling, dependency-aware sweeps, and sensible defaults.
💡 Run python scripts/run_experiments.py --dry-run to preview every job before launch.
python -m src.train.train_unisd exposes every UniSD flag for fine-grained control.
| <SUBCOMMAND> | agreement, ema, contrastive, match_joint, match_repr, clip, unisd_star (= UniSD*), induction |
| <MODE> | agreement_{seq,tok}_{random,retrieval,induction}, ema, contrastive, match_joint, match_repr, clip, unisd_star |
| <DATASET> | mbpp, tooluse, scienceqa, cos_e, medmcqa (eval-only: gpqa, humaneval) |
| <MODEL> | Qwen2.5 (0.5B/1.5B/3B/7B-Instruct), Llama-3.1-8B-Instruct, Gemma-3-4B-IT, InternLM3-8B-Instruct |
A few modes require a one-time cache build:
UniSD is evaluated across six benchmarks spanning four task families.
| 🔬 ScienceQA | train + eval | Scientific reasoning |
| 💻 MBPP | train + eval | Code generation |
| 💭 CoS-E | train + eval | Commonsense reasoning |
| 🛠️ ToolAlpaca | train + eval | Tool usage |
| 🎓 GPQA | OOD eval | Scientific reasoning |
| 🧪 HumanEval | OOD eval | Code generation |
UniSD is validated across three model families:
Evaluation entry points live under src/eval/:
If you find UniSD useful in your research, please cite:
UniSD is built on top of excellent open-source work from the community: 🤗 Transformers · 🤗 TRL · vLLM · DeepSpeed · PEFT · Accelerate.
This project is released under the Apache License 2.0.