Navigation errors dominate 27–52% of trials while tool-use errors stay below 17%. The bottleneck is reaching the right information, not using tools once found.
Moving from linear to DAG drops navigation scores (PVR) by 14–18 pp while tool-use scores (RCR) stay stable or even improve.
Claude Code matches Codex CLI at ~37% accuracy with 6× fewer tokens. The framework gap is larger than the model-scale gap.
A 120B reasoning model achieves only 3.1% accuracy—barely above the 10% random baseline—spending its budget on internal reasoning instead of tool calls.
Each puzzle (a leg) presents the agent with a seed Wikipedia URL, a cryptic riddle, and 19 tools. The agent must navigate pages, call APIs, and compute a single-digit passcode (0–9).
| Configuration | FA | PVR (Nav.) | RCR (Tool) |
|---|---|---|---|
| Codex CLI + GPT-5.4 | 34.8% | 52.9% | 66.7% |
| Codex CLI + GPT-5.4-mini | 32.1% | 48.0% | 55.3% |
| mini-swe-agent + GPT-5.4 | 30.5% | 51.4% | 43.7% |
| mini-swe-agent + GPT-5.4-mini | 27.2% | — | — |
| Claude Code + Sonnet 4 | 36.6% | 46.8% | 67.4% |
| Variant | Legs | Structure | Avg Stops | Avg Tools |
|---|---|---|---|---|
| AAR-Linear | 800 | Sequential chains | 15.0 | 4.0 |
| AAR-DAG | 600 | Fork-merge diamonds | 22.1 | 12.0 |
AAR ships with 4 difficulty levels, but the generation pipeline is fully adjustable—you can define custom levels by varying pit-stop count, roadblock density, detour frequency, diamond count, extraction type, and crawl depth.
| Level | Pit Stops | Roadblocks | Detours | Diamonds | Extraction |
|---|---|---|---|---|---|
| Easy | 3–6 | 1–2 | 1–2 | 1 | infobox, prose |
| Medium | 7–12 | 2–4 | 2–3 | 1–2 | + cross-section |
| Hard | 13–16 | 4–5 | 3–4 | 2–3 | + cross-section |
| Extreme | 17–21 | 5–7 | 4–6 | 3–5 | + cross-section |
The generation pipeline is fully open—create new puzzles from any Wikipedia seed. Requires OPENAI_API_KEY (planning/verbalization) and GOOGLE_API_KEY (tool-chain validation).
# Generate 10 random-seed puzzles per difficulty level
./scripts/batch_generate.sh
# Generate from a specific Wikipedia article
uv run python src/trail/generate.py \
--seed-url "https://en.wikipedia.org/wiki/Mount_Everest" \
--difficulty hard --num-samples 5
# Generate DAG puzzles (with diamond fork-merge patterns)
uv run python src/trail/generate.py \
--random-seeds --difficulty extreme --num-samples 10 \
--compositional
# Use curated seed URLs
uv run python src/trail/generate.py \
--seed-urls-file seeds/finance_seeds.txt \
--difficulty medium --num-samples 20
Generated puzzles are saved as JSON in data/trail_puzzles/{difficulty}/. Convert them to Harbor tasks using the adapter (see Evaluation via Harbor).
A leg is a directed acyclic graph (DAG) of pit stops, each producing a typed value:
Navigate to a Wikipedia page and extract a fact (e.g., a numeric infobox field, a date from prose).
Execute a multi-step tool chain, e.g., geocode a location then query the elevation API.
Apply an analytical transform to a prior value, e.g., next_prime(v), digit_sum(v).
Aggregate values from earlier stops via arithmetic to produce the final answer y* ∈ {0,...,9}.
Diamond patterns (DAG only): A source stop forks into two independent tool-chain branches (e.g., elevation and POI count), which merge into a combining stop. Diamond count scales with difficulty (1 for easy, up to 3–5 for extreme).
AAR provides 19 tools across eight categories:
fetch_webpageweb_search
maps_geocodemaps_reverse_geocodemaps_elevationmaps_distance_matrixmaps_directionsmaps_search_placesmaps_place_details
weather_historicalweather_forecast
python_execute_codepython_generate_code
countries_populationcountries_area
stock_historical_pricestock_volume
crypto_historical_pricecrypto_volume
AAR evaluations run through Harbor, an open-source agent evaluation framework. The dataset is published on the Harbor registry:
# Install Harbor
uv tool install harbor
# Run AAR with any agent
harbor run -d minnesotanlp/aar@1.0 -a claude-code -m anthropic/claude-sonnet-4-6
# Run with API keys
harbor run -d minnesotanlp/aar@1.0 -a claude-code -m anthropic/claude-sonnet-4-6 \
--ae GOOGLE_API_KEY=$GOOGLE_API_KEY
| Variable | Purpose | Required |
|---|---|---|
GOOGLE_API_KEY | Maps, elevation, directions, places | Yes |
OPENAI_API_KEY | If using OpenAI-based agents | Depends on agent |
SERPER_API_KEY | Web search tool | Optional |
# Clone and generate Harbor tasks
git clone https://github.com/minnesotanlp/the-amazing-agent-race.git
cd the-amazing-agent-race
python harbor-adapter/run_adapter.py \
--data-dir data/aar-linear --variant linear \
--output-dir /path/to/harbor/datasets/aar
python harbor-adapter/run_adapter.py \
--data-dir data/aar-dag --variant dag \
--output-dir /path/to/harbor/datasets/aar
# Run locally
harbor run -p /path/to/harbor/datasets/aar -a claude-code -m anthropic/claude-sonnet-4-6
Every leg satisfies six invariants:
@inproceedings{aar2026,
title={The Amazing Agent Race: Strong Tool Users, Weak Navigators},
author={Anonymous},
year={2026},
note={Under review}
}