BlogΒ·Apr 17, 2026Β·JobBench Team

JobBench: Aligning Agent Work with Human Desire

Measuring agents by GDP alone asks how much of a human's job can be taken away. Measuring agents by human desire asks how much of that job can be given back.

agent_01 says
β€œLet me reconcile. You decide.”

The conversation about AI in the workplace has been framed almost entirely in economic terms: What fraction of working hours can agents absorb? How much of GDP is exposed to automation? Benchmarks like OpenAI's GDPval inherit this framing by design β€” they select tasks that represent economic value, and score agents on whether they can deliver the professional knowledge output.

We believe this framing, on its own, is not enough.

If agents are going to share the professional workplace with humans, the question is not only what work is most economically valuable to automate, but what work do the humans in that role actually want automated? This is a humanist problem: it treats the professional not as labor to be displaced, but as a collaborator whose judgment about their own craft matters.

What work do the humans in that role actually want automated?

Ask a lawyer, a reporter, a biostatistician what they want an agent to take off their plate, and the answer is remarkably consistent: what they want offloaded is the tedious, high-volume, error-prone pre-processing that stands between them and the work they actually value β€” reconciling contradictory data sources, cross-referencing claims against raw records, pulling facts out of messy document dumps, tracing citations through cases.

JobBench is a benchmark built on that principle. Every one of its 60+ tasks across 30+ professions β€” lawyers, reporters, biostatisticians, civil engineers, financial managers, petroleum engineers, court clerks, supply chain managers, and more β€” is constructed from the work experts in that field say they most want a capable agent to handle.

We design tasks on top of Workbank, a worker-centered survey in which more than 1,500 workers, for each O*NET task summary of their own occupation, indicate whether they would want an AI agent to take that work over. From Workbank we selected the 30+ occupations at the intersection of high average worker desire for automation and high economic impact. Within each occupation, we sampled the task summaries rated highest in automation desire and, through annotators, experts, and AI assistance, designed them into full benchmark evaluations.

From knowledge delivery to professional reasoning

GDPval tasks test whether an agent can produce knowledge delivery from clean inputs β€” a legal memo from a statute summary, a news report from a press release. JobBench tests whether it can hold messy, contradictory information streams in one head and analyze them with professional reasoning integrity: bringing in domain knowledge, triangulating across multiple heterogeneous sources, and following thin clues out to external documents. The deliverable is not only the memo or the article, but the reasoning chain behind it β€” the part that experienced professionals spend most of their hours on, and most want to hand off.

Case study

JobBench aligns better with human desire.

Occupation
Reporters Β· Journalists
Occupation
Technical Sales Reps
Occupation
Lawyers
01
Real expert-reported desire
Fact-checking: check reference materials β€” books, news files, public records β€” to obtain relevant facts.
Proposal explanation: prepare sales presentations or proposals that explain product specifications or applications.
Statute study + outcome analysis: study statutes, regulations, and ordinances, and analyze likely case outcomes using legal precedents.
02
GDPval
Article edit: edit a story from a source packet and return one publishable article.
Quote revision: revise a quotation from pricing and freight references.
Closed-world memo: draft a memo from a self-contained fact pattern.
03
JobBench
Cross-year evidence synthesis: cross-reference water-quality CSVs, EPA guidance, and surveillance data across years; verify threshold exceedances, identify high-risk communities, and assemble a multi-part editorial package.
Bid-response package assembly: integrate an RFQ, site survey, internal pricing, product catalog, and competitor quote; verify certifications and build a compliance matrix.
Settlement evaluation + takings analysis: query a multi-table STR property database for per-property fine exposure and lost-income projections, apply Penn Central and Hignell-Stark to the Town's offer, and draft a comparative ordinance table plus a case-law-grounded counter-proposal.
04
Why JobBench aligns
JobBench aligns with the real reporting burden by requiring cross-dataset verification before publication; GDPval captures only article editing after the source packet has already been assembled.
JobBench aligns with the real pre-sale burden by requiring proposal assembly across specs, pricing, compliance, and competitor context; GDPval captures only isolated quote revision.
JobBench aligns with the real legal-preparation burden by requiring quantitative fine-exposure calculations, precedent application (Penn Central, Hignell-Stark), and an actionable counter-proposal; GDPval captures only closed-world memo analysis from a self-contained fact pattern.

Evaluation that rewards reasoning integrity

2,066 binary evaluation criteria, with an average of 32 criteria per task. Every criterion is anchored to deterministic numbers, specific reasoning steps, or documented professional judgments. Scores are rewarded only if the entire reasoning chain behind them is sound β€” no partial credit for surfacing the right fact via a wrong inference. This mirrors how senior professionals actually evaluate junior work, and it exposes a reasoning gap that inclusion-based knowledge checklists conceal.

Enhancement, not replacement

The better future of human–AI collaboration is not only the one in which agents fully replace human professionals. It is one in which agents take on the tedious, massive, error-prone work that experts have long wanted to be free of β€” the data reconciliation, the cross-referencing, the fact-checking against contradictory sources β€” so that humans can spend more of their time on what their training and judgment are uniquely for: deciding, advocating, creating, and caring.