JobBench: Aligning Agent Work with Human Desire
Measuring agents by GDP alone asks how much of a human's job can be taken away. Measuring agents by human desire asks how much of that job can be given back.
The conversation about AI in the workplace has been framed almost entirely in economic terms: What fraction of working hours can agents absorb? How much of GDP is exposed to automation? Benchmarks like OpenAI's GDPval inherit this framing by design β they select tasks that represent economic value, and score agents on whether they can deliver the professional knowledge output.
We believe this framing, on its own, is not enough.
If agents are going to share the professional workplace with humans, the question is not only what work is most economically valuable to automate, but what work do the humans in that role actually want automated? This is a humanist problem: it treats the professional not as labor to be displaced, but as a collaborator whose judgment about their own craft matters.
What work do the humans in that role actually want automated?
Ask a lawyer, a reporter, a biostatistician what they want an agent to take off their plate, and the answer is remarkably consistent: what they want offloaded is the tedious, high-volume, error-prone pre-processing that stands between them and the work they actually value β reconciling contradictory data sources, cross-referencing claims against raw records, pulling facts out of messy document dumps, tracing citations through cases.
JobBench is a benchmark built on that principle. Every one of its 60+ tasks across 30+ professions β lawyers, reporters, biostatisticians, civil engineers, financial managers, petroleum engineers, court clerks, supply chain managers, and more β is constructed from the work experts in that field say they most want a capable agent to handle.
We design tasks on top of Workbank, a worker-centered survey in which more than 1,500 workers, for each O*NET task summary of their own occupation, indicate whether they would want an AI agent to take that work over. From Workbank we selected the 30+ occupations at the intersection of high average worker desire for automation and high economic impact. Within each occupation, we sampled the task summaries rated highest in automation desire and, through annotators, experts, and AI assistance, designed them into full benchmark evaluations.
From knowledge delivery to professional reasoning
GDPval tasks test whether an agent can produce knowledge delivery from clean inputs β a legal memo from a statute summary, a news report from a press release. JobBench tests whether it can hold messy, contradictory information streams in one head and analyze them with professional reasoning integrity: bringing in domain knowledge, triangulating across multiple heterogeneous sources, and following thin clues out to external documents. The deliverable is not only the memo or the article, but the reasoning chain behind it β the part that experienced professionals spend most of their hours on, and most want to hand off.
JobBench aligns better with human desire.
Evaluation that rewards reasoning integrity
2,066 binary evaluation criteria, with an average of 32 criteria per task. Every criterion is anchored to deterministic numbers, specific reasoning steps, or documented professional judgments. Scores are rewarded only if the entire reasoning chain behind them is sound β no partial credit for surfacing the right fact via a wrong inference. This mirrors how senior professionals actually evaluate junior work, and it exposes a reasoning gap that inclusion-based knowledge checklists conceal.
Enhancement, not replacement
The better future of humanβAI collaboration is not only the one in which agents fully replace human professionals. It is one in which agents take on the tedious, massive, error-prone work that experts have long wanted to be free of β the data reconciliation, the cross-referencing, the fact-checking against contradictory sources β so that humans can spend more of their time on what their training and judgment are uniquely for: deciding, advocating, creating, and caring.