View all files | ||||
Hölder Policy Optimisation replaces the fixed aggregation of token-level importance sampling ratios in GRPO with the adaptable Hölder mean (p-norm). Modulating a dynamic p ∈ ℝ interpolates between sequence-level stability and token-level exploration.
Weights & Biases is opt-in: export WANDB_API_KEY=... and set USE_WB=1 when launching the script.
Knobs:
| HOLDER_P | constant $p$ (used when schedule is constant) |
| HOLDER_P_SCHEDULE | one of the schedules in The loss |
| HOLDER_P_MIN / _MAX | schedule bounds |
| HOLDER_P_SCHEDULE_STEPS | steps the schedule spans |
| HOLDER_P_POWER | exponent for quad / quad_dec |
| CLIPRANGE | PPO-style clip |
| MODEL_NAME | base policy |
| N_GPU / N_SAMPLE | parallelism / responses per prompt |
Edit the model=... path in scripts/eval.sh, then:
Default suite (under datasets/evaluation_suite_v2/): AIME24, AIME25, AMC, MATH500, Minerva, OlympiadBench.
HölderPO aggregates per-token importance ratios with a Hölder p-mean along the sequence and feeds the result into a PPO-style clipped surrogate. The variant controls whether aggregation is sequence-level or token-level; the schedule controls how p evolves over training.
Variants (--critic_type_modify):
| holder | sequence-level Hölder p-mean over token ratios |
| holder_token | token-level Hölder p-mean |
Schedules (--holder_p_schedule):
| constant | fixed p |
| linear / linear_dec | linear ramp up / down |
| sin / cos | sinusoidal ramp |
| quad / quad_dec | polynomial of order holder_p_power |
| cubic / cubic_dec | cubic ramp |
The agent-side experiments live on the agentic branch under alfworld/ — same Hölder loss, layered on top of verl-agent. See alfworld/HOLDER.md on that branch for setup and entry points.
This codebase builds on oat (maths RL stack) and the understand-r1-zero maths grader / data pipeline. The agent variant on the agentic branch forks verl-agent.
Apache-2.0 (see LICENSE). Vendored understand_r1_zero_main/ retains its upstream Apache-2.0 licence.