JobBench: Aligning Agent Work with Human Desire

Measuring agents by GDP alone asks how much of a human's job can be taken away.

JobBench asks how much of that job can be given back — built on the work that experts across 35 real-world professions actually want delegated to AI.

2,066 fact-anchored criteria, scored only when the entire reasoning chain is sound.

Current leader
GPT-5.4
OpenAI · via Codex CLI
Weighted score
37.2%
0
Professions
0
Tasks
0
Criteria

In collaboration with

University of Washington
UC Santa Barbara
Stanford University
Carnegie Mellon University
University of Notre Dame
IBM Research
BakeAI
Michigan State University
UC Berkeley
§ Why Human Desire

Economics alone is not enough.

The conversation about AI in the workplace has been framed almost entirely in economic terms: what fraction of working hours can agents absorb? how much of GDP is exposed to automation? Benchmarks like OpenAI's GDPval inherit this framing by design — they select tasks that represent economic value, and score agents on whether they can deliver the professional knowledge output.

We believe this framing, on its own, is not enough.

If agents are going to share the professional workplace with humans, the question is not only what work is most economically valuable to automate, but what work do the humans in that role actually want automated? This is a humanist problem. It treats the professional not as labor to be displaced, but as a collaborator whose judgment about their own craft matters — and it is the premise JobBench is built on.

The economic question

GDPval

OpenAI

“What fraction of a human's job is economically valuable to automate?”

  • Task selectionBy economic exposure
  • What it measuresKnowledge delivery from clean inputs
  • The professionalLabor to be displaced
The humanist question

JobBench

Ours

“What work do the humans in that role actually want automated?”

  • Task selectionFrom Workbank — 1,500+ workers' automation preferences
  • What it measuresProfessional reasoning across messy, contradictory sources
  • The professionalA craft to be enhanced, not replaced
§ 01 — Rankings

Model leaderboard

Overall weighted score across evaluated tasks, measured by rubric-based fact-anchored assessment.

Family
Scaffold
1
GPT-5.4
37.2
2
Claude Sonnet 4.6
36.3
3
Claude Opus 4.6
35.4
04
GPT-5.2
33.6
05
GPT-5.3 Codex
33.1
06
Claude Opus 4.5
31.0
07
Claude Sonnet 4.5
26.8
08
GPT-5.1 Codex
26.2
09
GPT-5.2 Codex
24.8
10
Claude Opus 4
20.9
11
Claude Sonnet 4
17.9
12
Qwen 3.5 Plus
17.6
13
Claude Haiku 4.5
15.2
14
MiniMax M2.5
14.2
15
Gemini 3 Pro
10.9
16
Gemini 3 Flash
10.8
17
Kimi K2.5
8.6
18
Grok 4.2 Fast
4.2

Score = weighted rubric score across all evaluated tasks.

§ 02 — Headroom

GDPval is saturating. JobBench isn’t.

Top-model score
GPT-5.4
GDPvalsaturating
83.0
JobBench63 pts headroom
37.2
GPT-5.2 Codex
70.9/24.8
GPT-5.3 Codex
70.9/33.1
GPT-5.4
83.0/37.2
Per-task workload
JobBench over GDPval
Wall-clock per task1.64×
Tool calls per task1.33×
Trajectory lines1.30×
§ 03 — Breakdown

Per-profession heatmap

Weighted score (%)
Occupationn
GPT-5.437.2
Sonnet4.636.3
Opus4.635.4
GPT-5.233.6
GPT-5.3Codex33.1
Opus4.531.0
Sonnet4.526.8
GPT-5.1Codex26.2
GPT-5.2Codex24.8
Opus420.9
Sonnet417.9
Qwen3.5 Plus17.6
Haiku4.515.2
MiniMaxM2.514.2
Gemini3 Pro10.9
Gemini3 Flash10.8
KimiK2.58.6
Grok4.2 Fast4.2
Business / Financial Ops
Bookkeeping & Accounting Clerks21923511743130191714449414900
HR Specialists1563147883419191941090000090
Licensing Examiners / Inspectors15033331717423317833333317331725420
Management Analysts3263018272413166013000310330
Personal Financial Advisors1334182336182110103110080231000
Purchasing Agents325434724343927211833161618871122
Training & Development Specialists33841342030423016303630221818161404
Office / Admin Support
Court Clerks1373237453747024211113110001300
Customer Service Reps1215029291650816298816021021160
Data Entry Keyers25966555861543947512036282632221779
Medical Secretaries15123413815158154180151588880
Police / Fire Dispatchers136473636361547472615574719301111150
Secretaries & Admin Assistants2723046464837202011204122306011206
Computer / Mathematical
Biostatisticians229251220461837572812282225281511119
CS Researchers216381911221220898141514400411
Statisticians3361844363437362622301514141414874
User Support Specialists239365748334538283219293922261212250
Web Administrators152483624242440121224121212121224120
Architecture / Engineering
Civil Engineers35355525135433649423018222624182536
Mechanical Eng. Technicians324322020192927251551215129146153
Mechanical Engineers13627522705218189000009990
Petroleum Engineers11228360161228322012012122001200
Management
Financial Managers21459442433142624329181018491504
Health Services Managers220332026819208819141488141484
IT / IS Managers2411736492724121715171510101581200
Supply Chain Managers21712176121212061706006066
Arts / Media
Producers1536442645339397264283131222280014
Reporters & Correspondents1472020374733233320431010130100010
Technical Writers355645045494554373435424141301211279
Other (Legal · Sales · Science · Edu.)
Lawyers150253825252550253801325131300250
Online Merchants263324543433021593038203014251420149
Securities Sales Agents13535142759272704124110000000
Soc. Sci. Research Assistants3585247595955374952272920242918201413
Sociology Teachers (Postsec.)357292836364534334135111714211715147
Tech & Sci. Sales Reps212101330181161110136191063000
Scale0–10%10–20%20–30%30–40%40%+n = tasks per occupation · header value = overall score
§ 04 — Methodology

From knowledge delivery to professional reasoning

What workers want delegated, held in one head, reasoned about with integrity — the four principles behind JobBench.

i.

Human-desire grounded

JobBench is grounded in Workbank — a worker-centered survey in which 1,500+ professionals rated each O*NET task description in their occupation, indicating which ones they would want an AI agent to take over. Every benchmark task is designed around the work these experts most want delegated.

ii.

Professional reasoning, not knowledge delivery

Agents must hold messy, contradictory streams in one head and triangulate — databases, PDFs, regulations — then produce the reasoning chain behind the answer, not just the answer.

iii.

Fact-anchored rubrics

Every rubric resolves to binary criteria anchored to verifiable numbers, reasoning steps, or professional judgments. Credit is awarded only when the full reasoning chain is sound — no partial credit for surfacing the right fact via a wrong inference.

iv.

Heterogeneous real-world data

SQLite partitions, multi-year CSVs, regulatory PDFs, contradictory disclosures — the kind of pre-processing experts most want offloaded.

§ 05 — Inside a task

What the agent is actually up against

Every JobBench task is a small dossier — heterogeneous files, buried parameters, and pairs of sources that quietly contradict each other. Pick a role to see what one looks like from the inside.

Role

Reporter — Connecticut investigative desk

Automation desire
4.00/5
Lead in Connecticut drinking water. The state says zero water hazards. The FOIA data says otherwise.

Why desiredInvestigative beat reporting is gated by source-verification time — PDFs, FOIA CSVs, and interview cross-checks eat the day.

ONET: Check reference materials, such as books, news files, or public records, to obtain relevant facts.

6 sources· 4 types·3 contradictions·4 reasoning hops
Source flow
click a source to expand
Heterogeneous inputs

Multiple Hartford-area systems exceed the 15 ppb federal action level.

conflicts with CT_2024_Surveillance_ReportFOIA exceedances vs. 0% home-hazard finding
conflicts with EPA_LCRI_FactsheetRule finalized vs. current enforcement cycle

0% of investigated homes identified water as a lead hazard.

conflicts with FOIA_water_dataFOIA exceedances vs. 0% home-hazard finding

CT rows only for 2017–2019; 2020–2022 are dagger-marked non-submissions.

conflicts with martinez_interviewCDC n=1,666 vs. Martinez 30% clinic-specific

10 ppb action level finalized Oct 2024 — not yet enforceable.

conflicts with FOIA_water_dataRule finalized vs. current enforcement cycle

Pediatric referrals up 30% post-threshold change (Dr. Martinez).

conflicts with CDC_2017_2022_Blood_LeadCDC n=1,666 vs. Martinez 30% clinic-specific

Waterbury 16.1 ppb vs. Newark 47.9 ppb — trajectory, not point-in-time.

Agent
reasoning over reporter sources
Deliverables
  • Thesis-driven pitch memo
  • 3-sheet data workbook
  • 15+ entry source verification log

Reasoning challenges by design

click for full detail