JobBench: Aligning Agent Work with Human Desire
Measuring agents by GDP alone asks how much of a human's job can be taken away.
JobBench asks how much of that job can be given back — built on the work that experts across 35 real-world professions actually want delegated to AI.
2,066 fact-anchored criteria, scored only when the entire reasoning chain is sound.
In collaboration with





Economics alone is not enough.
The conversation about AI in the workplace has been framed almost entirely in economic terms: what fraction of working hours can agents absorb? how much of GDP is exposed to automation? Benchmarks like OpenAI's GDPval inherit this framing by design — they select tasks that represent economic value, and score agents on whether they can deliver the professional knowledge output.
We believe this framing, on its own, is not enough.
If agents are going to share the professional workplace with humans, the question is not only what work is most economically valuable to automate, but what work do the humans in that role actually want automated? This is a humanist problem. It treats the professional not as labor to be displaced, but as a collaborator whose judgment about their own craft matters — and it is the premise JobBench is built on.
GDPval
OpenAI“What fraction of a human's job is economically valuable to automate?”
- Task selectionBy economic exposure
- What it measuresKnowledge delivery from clean inputs
- The professionalLabor to be displaced
JobBench
Ours“What work do the humans in that role actually want automated?”
- Task selectionFrom Workbank — 1,500+ workers' automation preferences
- What it measuresProfessional reasoning across messy, contradictory sources
- The professionalA craft to be enhanced, not replaced
Model leaderboard
Overall weighted score across evaluated tasks, measured by rubric-based fact-anchored assessment.
Score = weighted rubric score across all evaluated tasks.
GDPval is saturating. JobBench isn’t.
Per-profession heatmap
| Occupation | n | GPT-5.437.2 | Sonnet4.636.3 | Opus4.635.4 | GPT-5.233.6 | GPT-5.3Codex33.1 | Opus4.531.0 | Sonnet4.526.8 | GPT-5.1Codex26.2 | GPT-5.2Codex24.8 | Opus420.9 | Sonnet417.9 | Qwen3.5 Plus17.6 | Haiku4.515.2 | MiniMaxM2.514.2 | Gemini3 Pro10.9 | Gemini3 Flash10.8 | KimiK2.58.6 | Grok4.2 Fast4.2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Business / Financial Ops | |||||||||||||||||||
| Bookkeeping & Accounting Clerks | 2 | 19 | 23 | 51 | 17 | 43 | 13 | 0 | 19 | 17 | 14 | 4 | 4 | 9 | 4 | 14 | 9 | 0 | 0 |
| HR Specialists | 1 | 56 | 31 | 47 | 88 | 34 | 19 | 19 | 19 | 41 | 0 | 9 | 0 | 0 | 0 | 0 | 0 | 9 | 0 |
| Licensing Examiners / Inspectors | 1 | 50 | 33 | 33 | 17 | 17 | 42 | 33 | 17 | 8 | 33 | 33 | 33 | 17 | 33 | 17 | 25 | 42 | 0 |
| Management Analysts | 3 | 26 | 30 | 18 | 27 | 24 | 13 | 16 | 6 | 0 | 13 | 0 | 0 | 0 | 3 | 10 | 3 | 3 | 0 |
| Personal Financial Advisors | 1 | 33 | 41 | 8 | 23 | 36 | 18 | 21 | 10 | 10 | 31 | 10 | 0 | 8 | 0 | 23 | 10 | 0 | 0 |
| Purchasing Agents | 3 | 25 | 43 | 47 | 24 | 34 | 39 | 27 | 21 | 18 | 33 | 16 | 16 | 18 | 8 | 7 | 11 | 2 | 2 |
| Training & Development Specialists | 3 | 38 | 41 | 34 | 20 | 30 | 42 | 30 | 16 | 30 | 36 | 30 | 22 | 18 | 18 | 16 | 14 | 0 | 4 |
| Office / Admin Support | |||||||||||||||||||
| Court Clerks | 1 | 37 | 32 | 37 | 45 | 37 | 47 | 0 | 24 | 21 | 11 | 13 | 11 | 0 | 0 | 0 | 13 | 0 | 0 |
| Customer Service Reps | 1 | 21 | 50 | 29 | 29 | 16 | 50 | 8 | 16 | 29 | 8 | 8 | 16 | 0 | 21 | 0 | 21 | 16 | 0 |
| Data Entry Keyers | 2 | 59 | 66 | 55 | 58 | 61 | 54 | 39 | 47 | 51 | 20 | 36 | 28 | 26 | 32 | 22 | 17 | 7 | 9 |
| Medical Secretaries | 1 | 51 | 23 | 41 | 38 | 15 | 15 | 8 | 15 | 41 | 8 | 0 | 15 | 15 | 8 | 8 | 8 | 8 | 0 |
| Police / Fire Dispatchers | 1 | 36 | 47 | 36 | 36 | 36 | 15 | 47 | 47 | 26 | 15 | 57 | 47 | 19 | 30 | 11 | 11 | 15 | 0 |
| Secretaries & Admin Assistants | 2 | 72 | 30 | 46 | 46 | 48 | 37 | 20 | 20 | 11 | 20 | 41 | 22 | 30 | 6 | 0 | 11 | 20 | 6 |
| Computer / Mathematical | |||||||||||||||||||
| Biostatisticians | 2 | 29 | 25 | 12 | 20 | 46 | 18 | 37 | 57 | 28 | 12 | 28 | 22 | 25 | 28 | 15 | 11 | 11 | 9 |
| CS Researchers | 2 | 16 | 38 | 19 | 11 | 22 | 12 | 20 | 8 | 9 | 8 | 14 | 15 | 14 | 4 | 0 | 0 | 4 | 11 |
| Statisticians | 3 | 36 | 18 | 44 | 36 | 34 | 37 | 36 | 26 | 22 | 30 | 15 | 14 | 14 | 14 | 14 | 8 | 7 | 4 |
| User Support Specialists | 2 | 39 | 36 | 57 | 48 | 33 | 45 | 38 | 28 | 32 | 19 | 29 | 39 | 22 | 26 | 12 | 12 | 25 | 0 |
| Web Administrators | 1 | 52 | 48 | 36 | 24 | 24 | 24 | 40 | 12 | 12 | 24 | 12 | 12 | 12 | 12 | 12 | 24 | 12 | 0 |
| Architecture / Engineering | |||||||||||||||||||
| Civil Engineers | 3 | 53 | 55 | 52 | 51 | 35 | 43 | 36 | 49 | 42 | 30 | 18 | 22 | 26 | 24 | 18 | 25 | 3 | 6 |
| Mechanical Eng. Technicians | 3 | 24 | 32 | 20 | 20 | 19 | 29 | 27 | 25 | 15 | 5 | 12 | 15 | 12 | 9 | 14 | 6 | 15 | 3 |
| Mechanical Engineers | 1 | 36 | 27 | 52 | 27 | 0 | 52 | 18 | 18 | 9 | 0 | 0 | 0 | 0 | 0 | 9 | 9 | 9 | 0 |
| Petroleum Engineers | 1 | 12 | 28 | 36 | 0 | 16 | 12 | 28 | 32 | 20 | 12 | 0 | 12 | 12 | 20 | 0 | 12 | 0 | 0 |
| Management | |||||||||||||||||||
| Financial Managers | 2 | 14 | 59 | 44 | 24 | 33 | 14 | 26 | 24 | 32 | 9 | 18 | 10 | 18 | 4 | 9 | 15 | 0 | 4 |
| Health Services Managers | 2 | 20 | 33 | 20 | 26 | 8 | 19 | 20 | 8 | 8 | 19 | 14 | 14 | 8 | 8 | 14 | 14 | 8 | 4 |
| IT / IS Managers | 2 | 41 | 17 | 36 | 49 | 27 | 24 | 12 | 17 | 15 | 17 | 15 | 10 | 10 | 15 | 8 | 12 | 0 | 0 |
| Supply Chain Managers | 2 | 17 | 12 | 17 | 6 | 12 | 12 | 12 | 0 | 6 | 17 | 0 | 6 | 0 | 0 | 6 | 0 | 6 | 6 |
| Arts / Media | |||||||||||||||||||
| Producers | 1 | 53 | 64 | 42 | 64 | 53 | 39 | 39 | 72 | 64 | 28 | 31 | 31 | 22 | 22 | 8 | 0 | 0 | 14 |
| Reporters & Correspondents | 1 | 47 | 20 | 20 | 37 | 47 | 33 | 23 | 33 | 20 | 43 | 10 | 10 | 13 | 0 | 10 | 0 | 0 | 10 |
| Technical Writers | 3 | 55 | 64 | 50 | 45 | 49 | 45 | 54 | 37 | 34 | 35 | 42 | 41 | 41 | 30 | 12 | 11 | 27 | 9 |
| Other (Legal · Sales · Science · Edu.) | |||||||||||||||||||
| Lawyers | 1 | 50 | 25 | 38 | 25 | 25 | 25 | 50 | 25 | 38 | 0 | 13 | 25 | 13 | 13 | 0 | 0 | 25 | 0 |
| Online Merchants | 2 | 63 | 32 | 45 | 43 | 43 | 30 | 21 | 59 | 30 | 38 | 20 | 30 | 14 | 25 | 14 | 20 | 14 | 9 |
| Securities Sales Agents | 1 | 35 | 35 | 14 | 27 | 59 | 27 | 27 | 0 | 41 | 24 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Soc. Sci. Research Assistants | 3 | 58 | 52 | 47 | 59 | 59 | 55 | 37 | 49 | 52 | 27 | 29 | 20 | 24 | 29 | 18 | 20 | 14 | 13 |
| Sociology Teachers (Postsec.) | 3 | 57 | 29 | 28 | 36 | 36 | 45 | 34 | 33 | 41 | 35 | 11 | 17 | 14 | 21 | 17 | 15 | 14 | 7 |
| Tech & Sci. Sales Reps | 2 | 12 | 10 | 13 | 30 | 18 | 11 | 6 | 11 | 10 | 13 | 6 | 19 | 10 | 6 | 3 | 0 | 0 | 0 |
From knowledge delivery to professional reasoning
What workers want delegated, held in one head, reasoned about with integrity — the four principles behind JobBench.
Human-desire grounded
JobBench is grounded in Workbank — a worker-centered survey in which 1,500+ professionals rated each O*NET task description in their occupation, indicating which ones they would want an AI agent to take over. Every benchmark task is designed around the work these experts most want delegated.
Professional reasoning, not knowledge delivery
Agents must hold messy, contradictory streams in one head and triangulate — databases, PDFs, regulations — then produce the reasoning chain behind the answer, not just the answer.
Fact-anchored rubrics
Every rubric resolves to binary criteria anchored to verifiable numbers, reasoning steps, or professional judgments. Credit is awarded only when the full reasoning chain is sound — no partial credit for surfacing the right fact via a wrong inference.
Heterogeneous real-world data
SQLite partitions, multi-year CSVs, regulatory PDFs, contradictory disclosures — the kind of pre-processing experts most want offloaded.
What the agent is actually up against
Every JobBench task is a small dossier — heterogeneous files, buried parameters, and pairs of sources that quietly contradict each other. Pick a role to see what one looks like from the inside.
Reporter — Connecticut investigative desk
Why desiredInvestigative beat reporting is gated by source-verification time — PDFs, FOIA CSVs, and interview cross-checks eat the day.
ONET: Check reference materials, such as books, news files, or public records, to obtain relevant facts.
Multiple Hartford-area systems exceed the 15 ppb federal action level.
0% of investigated homes identified water as a lead hazard.
CT rows only for 2017–2019; 2020–2022 are dagger-marked non-submissions.
10 ppb action level finalized Oct 2024 — not yet enforceable.
Pediatric referrals up 30% post-threshold change (Dr. Martinez).
Waterbury 16.1 ppb vs. Newark 47.9 ppb — trajectory, not point-in-time.
- Thesis-driven pitch memo
- 3-sheet data workbook
- 15+ entry source verification log