Towards Direct Evaluation of Harness Optimizers via Priority Ranking
This work takes a first step towards the direct evaluation of harness optimizers via quantifying their step-level optimization ability in a cost- and time-efficient manner.
- Direct Optimizer Evaluation: Evaluates harness optimizers directly rather than using target agents' task improvement as proxy
- Priority Ranking: Quantifying optimizer ability to prioritize harness components (i.e, prompt, tool, workflow, and memory) that are expected to bring more improvement to the target agent
- Human-Verified Dataset, SHOR: Includes 182 curated optimization scenarios collected from real optimization trajectories
- Multi-Domain Coverage: Supports SWE-bench Verified, GAIA, Spider 2.0-lite, and τ²-Bench
- Cost-Efficient Evaluation: By utilizing SHOR, evaluating harness optimizer via priority ranking is on average 8× cheaper and 17× faster than conventional end-improvement observation from full harness optimization.
- SHOR
- 182 human-verified harnesses
- SHOR-Flaw
conda create -n shor python=3.10
conda activate shor
pip install -r requirements.txt
Set the API keys for the providers you plan to use:
export OPENAI_API_KEY=your_openai_api_key
export ANTHROPIC_API_KEY=your_anthropic_api_key
export GEMINI_API_KEY=your_gemini_api_key
export OPENROUTER_API_KEY=your_openrouter_api_key
export SERPER_API_KEY=your_serper_api_key # SerpAPI for web search (only needed for the GAIA domain)
export LLM_API_KEY=your_api_key # optional for Openhands-cli
export LLM_BASE_URL=https://your-proxy.example.com/v1 # optional for Openhands-cli
bash scripts/download_shor_data.sh
To implement your own coding agent as harness optimizer, follow build_harness_optimizer.md.
Built-in optimizers for references:
- openhands_cli: OpenHands CLI adapter
- claude_code_cli: Claude Code CLI adapter
- codex_cli: Codex CLI adapter
If you want to use the built-in optimizer adapters, install the corresponding CLI tools first.
OpenHands CLI
Install via uv (recommended):
uv tool install openhands --python 3.12
Claude Code
npm install -g @anthropic-ai/claude-code
Codex CLI
npm install -g @openai/codex
python src/shor/run_shor.py --optimizer your_optimizer_name
# Run only the first 10
python src/shor/run_shor.py --optimizer your_optimizer_name --limit 10
# Run in Parrallel
python src/shor/run_shor.py --optimizer your_optimizer_name --parallel 4
python src/shor/eval/evaluate_shor_results.py result/your_optimizer_name
If you use this repository in your research, please cite:
TBD
.
├── data/ # Agent assets including Shor, Shor-flaw
│ ├── gaia/ # GAIA-domain agent assets
│ └── ... # Other domain data
├── scripts/
│ └── download_shor_data.sh # Script for downloading data
└── src/
├── harness_optimizer/ # Harness optimizer interfaces and built-in adapters
└── shor/ # SHOR execution, configuration, and evaluation pipeline