The Amazing Agent Race

Strong Tool Users, Weak Navigators

A benchmark of 1,400 DAG-structured scavenger-hunt puzzles for evaluating LLM agents on multi-step tool use, web navigation, and arithmetic reasoning.

GitHub Paper Harbor HuggingFace License
1,400
Puzzle Legs
19
Tools
4+
Difficulty Levels
2
Variants
36.6%
Best Accuracy

Why AAR?

Compositionality gap, agent accuracy, and failure analysis
(a) Existing benchmarks are 55–100% linear; AAR is 0% linear (all DAGs). (b) Best agent accuracy is 36.6%. (c) Navigation errors dominate (27–52%) while tool-use errors stay below 17%.

Key Findings

Strong tool users, weak navigators

Navigation errors dominate 27–52% of trials while tool-use errors stay below 17%. The bottleneck is reaching the right information, not using tools once found.

DAG structure amplifies the gap

Moving from linear to DAG drops navigation scores (PVR) by 14–18 pp while tool-use scores (RCR) stay stable or even improve.

Architecture ≥ model scale

Claude Code matches Codex CLI at ~37% accuracy with 6× fewer tokens. The framework gap is larger than the model-scale gap.

Reasoning models struggle

A 120B reasoning model achieves only 3.1% accuracy—barely above the 10% random baseline—spending its budget on internal reasoning instead of tool calls.

Example Puzzle

Each puzzle (a leg) presents the agent with a seed Wikipedia URL, a cryptic riddle, and 19 tools. The agent must navigate pages, call APIs, and compute a single-digit passcode (0–9).

Example DAG trail puzzle
An example clue envelope: a DAG trail themed "NYSE to Global Finance" with 14 stops, showing route info, detour, roadblock, diamond fork, merge, and finish-line stops.

Results

Aggregate Performance (1,400 Legs)

Aggregate results and FA by difficulty
(a) FA, PVR, RCR across all configurations. PVR is consistently the weakest metric. (b) FA degrades monotonically with difficulty (−14 pp best to −19 pp worst).
ConfigurationFAPVR (Nav.)RCR (Tool)
Codex CLI + GPT-5.434.8%52.9%66.7%
Codex CLI + GPT-5.4-mini32.1%48.0%55.3%
mini-swe-agent + GPT-5.430.5%51.4%43.7%
mini-swe-agent + GPT-5.4-mini27.2%
Claude Code + Sonnet 436.6%46.8%67.4%

DAG Structure Penalizes Navigation, Not Tool Use

Linear vs DAG delta
Percentage-point change from AAR-Linear to AAR-DAG. Navigation (PVR) drops 14–18 pp while tool use (RCR) remains stable or increases.

Benchmark Overview

VariantLegsStructureAvg StopsAvg Tools
AAR-Linear800Sequential chains15.04.0
AAR-DAG600Fork-merge diamonds22.112.0

Difficulty Levels

AAR ships with 4 difficulty levels, but the generation pipeline is fully adjustable—you can define custom levels by varying pit-stop count, roadblock density, detour frequency, diamond count, extraction type, and crawl depth.

LevelPit StopsRoadblocksDetoursDiamondsExtraction
Easy3–61–21–21infobox, prose
Medium7–122–42–31–2+ cross-section
Hard13–164–53–42–3+ cross-section
Extreme17–215–74–63–5+ cross-section

Generation Pipeline

Eight-step generation pipeline
Automated eight-step pipeline: Crawl, Plan, Build, Validate, Link, Augment, Execute, Verbalize—with validation gates producing three complementary metrics.

Generating New Puzzles

The generation pipeline is fully open—create new puzzles from any Wikipedia seed. Requires OPENAI_API_KEY (planning/verbalization) and GOOGLE_API_KEY (tool-chain validation).

# Generate 10 random-seed puzzles per difficulty level
./scripts/batch_generate.sh

# Generate from a specific Wikipedia article
uv run python src/trail/generate.py \
  --seed-url "https://en.wikipedia.org/wiki/Mount_Everest" \
  --difficulty hard --num-samples 5

# Generate DAG puzzles (with diamond fork-merge patterns)
uv run python src/trail/generate.py \
  --random-seeds --difficulty extreme --num-samples 10 \
  --compositional

# Use curated seed URLs
uv run python src/trail/generate.py \
  --seed-urls-file seeds/finance_seeds.txt \
  --difficulty medium --num-samples 20

Generated puzzles are saved as JSON in data/trail_puzzles/{difficulty}/. Convert them to Harbor tasks using the adapter (see Evaluation via Harbor).

Leg Structure

A leg is a directed acyclic graph (DAG) of pit stops, each producing a typed value:

Route Info

Navigate to a Wikipedia page and extract a fact (e.g., a numeric infobox field, a date from prose).

Roadblock

Execute a multi-step tool chain, e.g., geocode a location then query the elevation API.

Detour

Apply an analytical transform to a prior value, e.g., next_prime(v), digit_sum(v).

Finish Line

Aggregate values from earlier stops via arithmetic to produce the final answer y* ∈ {0,...,9}.

Diamond patterns (DAG only): A source stop forks into two independent tool-chain branches (e.g., elevation and POI count), which merge into a combining stop. Diamond count scales with difficulty (1 for easy, up to 3–5 for extreme).

Metrics

FA
Finish-line Accuracy
Does the agent's single-digit answer match the golden code?
PVR
Pit-stop Visit Rate
Fraction of required Wikipedia pages the agent actually visited.
RCR
Roadblock Completion Rate
Fraction of required tool chains the agent fully executed.

Tool Set

AAR provides 19 tools across eight categories:

Fetch & Search
fetch_webpage
web_search
Google Maps
maps_geocode
maps_reverse_geocode
maps_elevation
maps_distance_matrix
maps_directions
maps_search_places
maps_place_details
Weather
weather_historical
weather_forecast
Code
python_execute_code
python_generate_code
Countries
countries_population
countries_area
Stocks
stock_historical_price
stock_volume
Crypto
crypto_historical_price
crypto_volume

Evaluation via Harbor

AAR evaluations run through Harbor, an open-source agent evaluation framework. The dataset is published on the Harbor registry:

# Install Harbor
uv tool install harbor

# Run AAR with any agent
harbor run -d minnesotanlp/aar@1.0 -a claude-code -m anthropic/claude-sonnet-4-6

# Run with API keys
harbor run -d minnesotanlp/aar@1.0 -a claude-code -m anthropic/claude-sonnet-4-6 \
  --ae GOOGLE_API_KEY=$GOOGLE_API_KEY

Required API Keys

VariablePurposeRequired
GOOGLE_API_KEYMaps, elevation, directions, placesYes
OPENAI_API_KEYIf using OpenAI-based agentsDepends on agent
SERPER_API_KEYWeb search toolOptional

Local Evaluation

# Clone and generate Harbor tasks
git clone https://github.com/minnesotanlp/the-amazing-agent-race.git
cd the-amazing-agent-race

python harbor-adapter/run_adapter.py \
  --data-dir data/aar-linear --variant linear \
  --output-dir /path/to/harbor/datasets/aar

python harbor-adapter/run_adapter.py \
  --data-dir data/aar-dag --variant dag \
  --output-dir /path/to/harbor/datasets/aar

# Run locally
harbor run -p /path/to/harbor/datasets/aar -a claude-code -m anthropic/claude-sonnet-4-6

Quality Assurance

Every leg satisfies six invariants:

Citation

@inproceedings{aar2026,
  title={The Amazing Agent Race: Strong Tool Users, Weak Navigators},
  author={Anonymous},
  year={2026},
  note={Under review}
}