JobBench: Aligning Agent Work with Human Desire

Measuring agents by GDP alone asks how much of a human's job can be taken away.

JobBench asks how much of that job can be given back — built on the work that experts across 35 real-world professions actually want delegated to AI.

2,066 fact-anchored criteria, scored only when the entire reasoning chain is sound.

View leaderboard Read the blog GitHub ↗Hugging Face ↗

agent_01

Current leader

GPT-5.4

OpenAI · via Codex CLI

Weighted score

37.2%

Professions

Tasks

Criteria

In collaboration with

§ Why Human Desire

Economics alone is not enough.

agent_01 says

“Let me reconcile. You decide.”

The conversation about AI in the workplace has been framed almost entirely in economic terms: what fraction of working hours can agents absorb? how much of GDP is exposed to automation? Benchmarks like OpenAI's GDPval inherit this framing by design — they select tasks that represent economic value, and score agents on whether they can deliver the professional knowledge output.

We believe this framing, on its own, is not enough.

If agents are going to share the professional workplace with humans, the question is not only what work is most economically valuable to automate, but what work do the humans in that role actually want automated? This is a humanist problem. It treats the professional not as labor to be displaced, but as a collaborator whose judgment about their own craft matters — and it is the premise JobBench is built on.

The economic question

GDPval

OpenAI

“What fraction of a human's job is economically valuable to automate?”

Task selectionBy economic exposure
What it measuresKnowledge delivery from clean inputs
The professionalLabor to be displaced

The humanist question

JobBench

Ours

“What work do the humans in that role actually want automated?”

Task selectionFrom Workbank — 1,500+ workers' automation preferences
What it measuresProfessional reasoning across messy, contradictory sources
The professionalA craft to be enhanced, not replaced

Read the full essay

§ 01 — Rankings

Model leaderboard

Overall weighted score across evaluated tasks, measured by rubric-based fact-anchored assessment.

Family

Scaffold

GPT-5.4

37.2

Codex CLI

Claude Sonnet 4.6

36.3

Claude Code

Claude Opus 4.6

35.4

Claude Code

GPT-5.2

33.6

Codex CLI

GPT-5.3 Codex

33.1

Codex CLI

Claude Opus 4.5

31.0

Claude Code

Claude Sonnet 4.5

26.8

Claude Code

GPT-5.1 Codex

26.2

Codex CLI

GPT-5.2 Codex

24.8

Codex CLI

Claude Opus 4

20.9

Claude Code

Claude Sonnet 4

17.9

Claude Code

Qwen 3.5 Plus

17.6

OpenCode

Claude Haiku 4.5

15.2

Claude Code

MiniMax M2.5

14.2

OpenCode

Gemini 3 Pro

10.9

OpenCode

Gemini 3 Flash

10.8

OpenCode

Kimi K2.5

8.6

OpenCode

Grok 4.2 Fast

4.2

OpenCode

Score = weighted rubric score across all evaluated tasks.

§ 02 — Headroom

GDPval is saturating. JobBench isn’t.

Top-model score

GPT-5.4

0 ———— 100

GDPvalsaturating

83.0

JobBench63 pts headroom

37.2

GPT-5.2 Codex

70.9/24.8

GPT-5.3 Codex

70.9/33.1

GPT-5.4

83.0/37.2

Per-task workload

JobBench over GDPval

Wall-clock per task1.64×

Tool calls per task1.33×

Trajectory lines1.30×

§ 03 — Breakdown

Per-profession heatmap

Weighted score (%)

Occupation	n	GPT-5.437.2	Sonnet4.636.3	Opus4.635.4	GPT-5.233.6	GPT-5.3Codex33.1	Opus4.531.0	Sonnet4.526.8	GPT-5.1Codex26.2	GPT-5.2Codex24.8	Opus420.9	Sonnet417.9	Qwen3.5 Plus17.6	Haiku4.515.2	MiniMaxM2.514.2	Gemini3 Pro10.9	Gemini3 Flash10.8	KimiK2.58.6	Grok4.2 Fast4.2
Business / Financial Ops
Bookkeeping & Accounting Clerks	2	19	23	51	17	43	13	0	19	17	14	4	4	9	4	14	9	0	0
HR Specialists	1	56	31	47	88	34	19	19	19	41	0	9	0	0	0	0	0	9	0
Licensing Examiners / Inspectors	1	50	33	33	17	17	42	33	17	8	33	33	33	17	33	17	25	42	0
Management Analysts	3	26	30	18	27	24	13	16	6	0	13	0	0	0	3	10	3	3	0
Personal Financial Advisors	1	33	41	8	23	36	18	21	10	10	31	10	0	8	0	23	10	0	0
Purchasing Agents	3	25	43	47	24	34	39	27	21	18	33	16	16	18	8	7	11	2	2
Training & Development Specialists	3	38	41	34	20	30	42	30	16	30	36	30	22	18	18	16	14	0	4
Office / Admin Support
Court Clerks	1	37	32	37	45	37	47	0	24	21	11	13	11	0	0	0	13	0	0
Customer Service Reps	1	21	50	29	29	16	50	8	16	29	8	8	16	0	21	0	21	16	0
Data Entry Keyers	2	59	66	55	58	61	54	39	47	51	20	36	28	26	32	22	17	7	9
Medical Secretaries	1	51	23	41	38	15	15	8	15	41	8	0	15	15	8	8	8	8	0
Police / Fire Dispatchers	1	36	47	36	36	36	15	47	47	26	15	57	47	19	30	11	11	15	0
Secretaries & Admin Assistants	2	72	30	46	46	48	37	20	20	11	20	41	22	30	6	0	11	20	6
Computer / Mathematical
Biostatisticians	2	29	25	12	20	46	18	37	57	28	12	28	22	25	28	15	11	11	9
CS Researchers	2	16	38	19	11	22	12	20	8	9	8	14	15	14	4	0	0	4	11
Statisticians	3	36	18	44	36	34	37	36	26	22	30	15	14	14	14	14	8	7	4
User Support Specialists	2	39	36	57	48	33	45	38	28	32	19	29	39	22	26	12	12	25	0
Web Administrators	1	52	48	36	24	24	24	40	12	12	24	12	12	12	12	12	24	12	0
Architecture / Engineering
Civil Engineers	3	53	55	52	51	35	43	36	49	42	30	18	22	26	24	18	25	3	6
Mechanical Eng. Technicians	3	24	32	20	20	19	29	27	25	15	5	12	15	12	9	14	6	15	3
Mechanical Engineers	1	36	27	52	27	0	52	18	18	9	0	0	0	0	0	9	9	9	0
Petroleum Engineers	1	12	28	36	0	16	12	28	32	20	12	0	12	12	20	0	12	0	0
Management
Financial Managers	2	14	59	44	24	33	14	26	24	32	9	18	10	18	4	9	15	0	4
Health Services Managers	2	20	33	20	26	8	19	20	8	8	19	14	14	8	8	14	14	8	4
IT / IS Managers	2	41	17	36	49	27	24	12	17	15	17	15	10	10	15	8	12	0	0
Supply Chain Managers	2	17	12	17	6	12	12	12	0	6	17	0	6	0	0	6	0	6	6
Arts / Media
Producers	1	53	64	42	64	53	39	39	72	64	28	31	31	22	22	8	0	0	14
Reporters & Correspondents	1	47	20	20	37	47	33	23	33	20	43	10	10	13	0	10	0	0	10
Technical Writers	3	55	64	50	45	49	45	54	37	34	35	42	41	41	30	12	11	27	9
Other (Legal · Sales · Science · Edu.)
Lawyers	1	50	25	38	25	25	25	50	25	38	0	13	25	13	13	0	0	25	0
Online Merchants	2	63	32	45	43	43	30	21	59	30	38	20	30	14	25	14	20	14	9
Securities Sales Agents	1	35	35	14	27	59	27	27	0	41	24	11	0	0	0	0	0	0	0
Soc. Sci. Research Assistants	3	58	52	47	59	59	55	37	49	52	27	29	20	24	29	18	20	14	13
Sociology Teachers (Postsec.)	3	57	29	28	36	36	45	34	33	41	35	11	17	14	21	17	15	14	7
Tech & Sci. Sales Reps	2	12	10	13	30	18	11	6	11	10	13	6	19	10	6	3	0	0	0

Scale0–10%10–20%20–30%30–40%40%+n = tasks per occupation · header value = overall score

§ 04 — Methodology

From knowledge delivery to professional reasoning

What workers want delegated, held in one head, reasoned about with integrity — the four principles behind JobBench.

Human-desire grounded

JobBench is grounded in Workbank — a worker-centered survey in which 1,500+ professionals rated each O*NET task description in their occupation, indicating which ones they would want an AI agent to take over. Every benchmark task is designed around the work these experts most want delegated.

ii.

Professional reasoning, not knowledge delivery

Agents must hold messy, contradictory streams in one head and triangulate — databases, PDFs, regulations — then produce the reasoning chain behind the answer, not just the answer.

iii.

Fact-anchored rubrics

Every rubric resolves to binary criteria anchored to verifiable numbers, reasoning steps, or professional judgments. Credit is awarded only when the full reasoning chain is sound — no partial credit for surfacing the right fact via a wrong inference.

iv.

Heterogeneous real-world data

SQLite partitions, multi-year CSVs, regulatory PDFs, contradictory disclosures — the kind of pre-processing experts most want offloaded.

§ 05 — Inside a task

What the agent is actually up against

Every JobBench task is a small dossier — heterogeneous files, buried parameters, and pairs of sources that quietly contradict each other. Pick a role to see what one looks like from the inside.

Role

Reporter — Connecticut investigative desk

Automation desire

4.00/5

Lead in Connecticut drinking water. The state says zero water hazards. The FOIA data says otherwise.

Why desiredInvestigative beat reporting is gated by source-verification time — PDFs, FOIA CSVs, and interview cross-checks eat the day.

ONET: Check reference materials, such as books, news files, or public records, to obtain relevant facts.

6 sources· 4 types·3 contradictions·4 reasoning hops

Source flow

click a source to expand

Heterogeneous inputs

Multiple Hartford-area systems exceed the 15 ppb federal action level.

conflicts with CT_2024_Surveillance_Report — FOIA exceedances vs. 0% home-hazard finding

conflicts with EPA_LCRI_Factsheet — Rule finalized vs. current enforcement cycle

0% of investigated homes identified water as a lead hazard.

conflicts with FOIA_water_data — FOIA exceedances vs. 0% home-hazard finding

CT rows only for 2017–2019; 2020–2022 are dagger-marked non-submissions.

conflicts with martinez_interview — CDC n=1,666 vs. Martinez 30% clinic-specific

10 ppb action level finalized Oct 2024 — not yet enforceable.

conflicts with FOIA_water_data — Rule finalized vs. current enforcement cycle

Pediatric referrals up 30% post-threshold change (Dr. Martinez).

conflicts with CDC_2017_2022_Blood_Lead — CDC n=1,666 vs. Martinez 30% clinic-specific

Waterbury 16.1 ppb vs. Newark 47.9 ppb — trajectory, not point-in-time.

Agent

reasoning over reporter sources

Deliverables

Thesis-driven pitch memo
3-sheet data workbook
15+ entry source verification log

Reasoning challenges by design

click for full detail

JobBench: Aligning Agent Work with Human Desire

Economics alone is not enough.

GDPval

JobBench

Model leaderboard

GDPval is saturating. JobBench isn’t.

Per-profession heatmap

From knowledge delivery to professional reasoning

Human-desire grounded

Professional reasoning, not knowledge delivery

Fact-anchored rubrics

Heterogeneous real-world data

What the agent is actually up against

Reporter — Connecticut investigative desk

Reasoning challenges by design

Reconcile the water-vs-paint contradiction

Fact-check Dr. Martinez's 30% quote

LCRI: finalized vs. enforceable

90th-percentile action-level rule