agent-bench

reference device = 100 per pillar · comparisons valid only within a suite version · suite

↑ higher is better · ↓ lower is better · green = best in column · hover a heading for details · click a heading to sort · click a device for history · 🔋 battery · 🛡 Defender on

What do these columns mean?

Inference / Agent Env ↑: Composite scores. Inference = local LLM capability; Agent Env = speed of the local work a cloud agent does between model calls. Geometric mean of key metrics, reference device = 100. The grey multiplier is the same number as a ratio (160 = 1.60×).
Mem BW ↑: CPU-side memory bandwidth (STREAM triad, GB/s). Informational — never scored — it explains decode results: decode speed tracks the bandwidth of whichever memory the model sits in.
8B / 32B decode ↑: Tokens generated per second. The 32B column exposes the memory cliff: a model spilling out of VRAM collapses here.
8B prefill ↑: Prompt-reading tokens/sec — raw compute; where GPUs shine.
TTFT@32k ↓: Measured wait before the first token on a 32,768-token prompt (real request via llama-server).
Max model ↑: Largest test model that runs usably (≥3 tok/s decode). Completing at a crawl after spilling out of fast memory doesn't count.
Sustained ↑: Throughput after 20 min continuous generation ÷ start. Below 1.00 = thermal throttling.
Conc× ↑: Aggregate throughput of 4 simultaneous generation streams vs 1. How well the machine serves multi-agent load.
tok/W ↑: Decode tokens per watt of measured power during the 8B model run. Higher = more efficient. Power source varies by platform (macOS: full CPU+GPU+ANE package via powermetrics, requires sudo; Windows: GPU only via nvidia-smi; Linux: CPU RAPL + GPU where available). Never scored — informational context for the inference numbers.
Spawns / File ops ↑, Git / Build ↓: The agent-environment tests: process launches/sec, small-file ops/sec, git status+log+diff time, and cold compile of a fixed project with a pinned Go toolchain.
Task suitability verdicts: Local 8B chat: 8B decode ✓≥30 ~≥10 t/s. Local 32B models: must fit usably, then 32B decode ✓≥10 ~≥4 t/s. Long-context local (32k): TTFT ✓≤15 ~≤60 s. Cloud coding agent (Claude Code / Codex / Cowork — local tool-execution speed): worst of file ops ✓≥15k ~≥5k, spawns ✓≥100 ~≥40, build ✓≤5 ~≤12 s. Multi-agent local: 4-stream aggregate (8B decode × Conc×) ✓≥150 ~≥50 t/s. Marathon sessions: sustained ✓≥0.95 ~≥0.80.