The Amazing Agent Race

Key Findings

Strong tool users, weak navigators

Navigation errors dominate 27–52% of trials while tool-use errors stay below 17%. The bottleneck is reaching the right information, not using tools once found.

DAG structure amplifies the gap

Moving from linear to DAG drops navigation scores (PVR) by 14–18 pp while tool-use scores (RCR) stay stable or even improve.

Architecture ≥ model scale

Claude Code matches Codex CLI at ~37% accuracy with 6× fewer tokens. The framework gap is larger than the model-scale gap.

Reasoning models struggle

A 120B reasoning model achieves only 3.1% accuracy—barely above the 10% random baseline—spending its budget on internal reasoning instead of tool calls.

Results

Aggregate Performance (1,400 Legs)

Configuration	FA	PVR (Nav.)	RCR (Tool)
Codex CLI + GPT-5.4	34.8%	52.9%	66.7%
Codex CLI + GPT-5.4-mini	32.1%	48.0%	55.3%
mini-swe-agent + GPT-5.4	30.5%	51.4%	43.7%
mini-swe-agent + GPT-5.4-mini	27.2%	—	—
Claude Code + Sonnet 4	36.6%	46.8%	67.4%

DAG Structure Penalizes Navigation, Not Tool Use

Linear vs DAG delta — Percentage-point change from AAR-Linear to AAR-DAG. Navigation (PVR) drops 14–18 pp while tool use (RCR) remains stable or increases.

Benchmark Overview

Variant	Legs	Structure	Avg Stops	Avg Tools
AAR-Linear	800	Sequential chains	15.0	4.0
AAR-DAG	600	Fork-merge diamonds	22.1	12.0

Difficulty Levels

AAR ships with 4 difficulty levels, but the generation pipeline is fully adjustable—you can define custom levels by varying pit-stop count, roadblock density, detour frequency, diamond count, extraction type, and crawl depth.

Level	Pit Stops	Roadblocks	Detours	Diamonds	Extraction
Easy	3–6	1–2	1–2	1	infobox, prose
Medium	7–12	2–4	2–3	1–2	+ cross-section
Hard	13–16	4–5	3–4	2–3	+ cross-section
Extreme	17–21	5–7	4–6	3–5	+ cross-section

Generation Pipeline

Generating New Puzzles

The generation pipeline is fully open—create new puzzles from any Wikipedia seed. Requires OPENAI_API_KEY (planning/verbalization) and GOOGLE_API_KEY (tool-chain validation).

# Generate 10 random-seed puzzles per difficulty level
./scripts/batch_generate.sh

# Generate from a specific Wikipedia article
uv run python src/trail/generate.py \
  --seed-url "https://en.wikipedia.org/wiki/Mount_Everest" \
  --difficulty hard --num-samples 5

# Generate DAG puzzles (with diamond fork-merge patterns)
uv run python src/trail/generate.py \
  --random-seeds --difficulty extreme --num-samples 10 \
  --compositional

# Use curated seed URLs
uv run python src/trail/generate.py \
  --seed-urls-file seeds/finance_seeds.txt \
  --difficulty medium --num-samples 20

Generated puzzles are saved as JSON in data/trail_puzzles/{difficulty}/. Convert them to Harbor tasks using the adapter (see Evaluation via Harbor).

Leg Structure

A leg is a directed acyclic graph (DAG) of pit stops, each producing a typed value:

Route Info

Navigate to a Wikipedia page and extract a fact (e.g., a numeric infobox field, a date from prose).

Roadblock

Execute a multi-step tool chain, e.g., geocode a location then query the elevation API.

Detour

Apply an analytical transform to a prior value, e.g., next_prime(v), digit_sum(v).

Finish Line

Aggregate values from earlier stops via arithmetic to produce the final answer y* ∈ {0,...,9}.

Diamond patterns (DAG only): A source stop forks into two independent tool-chain branches (e.g., elevation and POI count), which merge into a combining stop. Diamond count scales with difficulty (1 for easy, up to 3–5 for extreme).

Tool Set

AAR provides 19 tools across eight categories:

Fetch & Search

fetch_webpage
web_search

Google Maps

maps_geocode
maps_reverse_geocode
maps_elevation
maps_distance_matrix
maps_directions
maps_search_places
maps_place_details

Weather

weather_historical
weather_forecast

Code

python_execute_code
python_generate_code

Countries

countries_population
countries_area

Stocks

stock_historical_price
stock_volume

Crypto

crypto_historical_price
crypto_volume

Evaluation via Harbor

AAR evaluations run through Harbor, an open-source agent evaluation framework. The dataset is published on the Harbor registry:

# Install Harbor
uv tool install harbor

# Run AAR with any agent
harbor run -d minnesotanlp/aar@1.0 -a claude-code -m anthropic/claude-sonnet-4-6

# Run with API keys
harbor run -d minnesotanlp/aar@1.0 -a claude-code -m anthropic/claude-sonnet-4-6 \
  --ae GOOGLE_API_KEY=$GOOGLE_API_KEY

Required API Keys

Variable	Purpose	Required
`GOOGLE_API_KEY`	Maps, elevation, directions, places	Yes
`OPENAI_API_KEY`	If using OpenAI-based agents	Depends on agent
`SERPER_API_KEY`	Web search tool	Optional

Local Evaluation

# Clone and generate Harbor tasks
git clone https://github.com/minnesotanlp/the-amazing-agent-race.git
cd the-amazing-agent-race

python harbor-adapter/run_adapter.py \
  --data-dir data/aar-linear --variant linear \
  --output-dir /path/to/harbor/datasets/aar

python harbor-adapter/run_adapter.py \
  --data-dir data/aar-dag --variant dag \
  --output-dir /path/to/harbor/datasets/aar

# Run locally
harbor run -p /path/to/harbor/datasets/aar -a claude-code -m anthropic/claude-sonnet-4-6

Quality Assurance

Every leg satisfies six invariants:

Solvability — Golden executor produces the correct answer at generation time
API stability — Cached traces and page snapshots for reproducibility
Input cleanliness — Geocodability filtering for all location inputs
Clue-envelope integrity — Round-trip alignment ≥ 0.7; no direct Wikipedia titles in clues
Contamination resistance — Clue paraphrasing, live API dependencies, unseen transforms, modular arithmetic
Inter-instance diversity — Mean pairwise Jaccard similarity of 0.0005 across 10K sampled pairs; 99.1% share zero pages