View all files | ||||
Code for the experiments in The Distillation Game: Adaptive Attacks & Efficient Defenses.
The repository provides an end-to-end pipeline for teacher-generation and student-distillation experiments. Supported teacher methods include standard decoding, antidistillation, and product-of-experts; supported student modes are passive (naive) and adaptive (strategic_fd).
The default configurations target GPU-backed runs.
Useful configs:
Each run writes a timestamped directory under outputs/ with the config snapshot, run manifest, teacher and student artifacts, and a RESULTS.md summary.
Baseline pipeline that distills traces from frontier LLMs (OpenAI, Gemini, Claude) into the local student. Chains three stages — query the APIs, SFT per (provider, dataset, seed), plot.
Common flags: --providers, --datasets, --seeds, --num-samples, --skip-{query,sft,plot}, --plot-only, --output-dir. Set API keys at the top of src/frontier-llms/query_trace_frontier.py first.
Writes traces to traces_llms/, SFT runs to outputs/real_*/, and plots to outputs/plots/.
Scores teacher traces on a 1–5 auditability rubric via the Claude API and plots a per-method PMF (Standard, PoE, ADS).
Common flags: --datasets, --model, --max-examples, --plot-only. Set API_KEY at the top of the script first.
Reads plot-quality/<dataset>/train_*.json and writes scored JSONs plus trace_quality_pmf.pdf next to them.
If you use this repository, please cite the accompanying paper:
The Distillation Game: Adaptive Attacks & Efficient Defenses.
Typical run artifacts include: