OpenAI Launches GDPval to Measure AI on Real-World, Economically Valuable Tasks

Praveen Kumar
Sep 25
1.8k
0
5

News

Image Courtesy: OpenAI

OpenAI has introduced GDPval, a new evaluation benchmark designed to measure how well AI models perform on real-world, economically valuable tasks across 44 occupations.

Unlike traditional AI benchmarks that focus on academic-style exams or synthetic challenges, GDPval evaluates models on work products that professionals actually create—from legal briefs and engineering blueprints to customer support conversations and nursing care plans.

Why GDPval Matters

Previous benchmarks like MMLU (multi-subject exam-style questions), SWE-Bench (software bug fixing), and SWE-Lancer (freelance software projects) pushed AI forward but only measured narrow slices of knowledge work. GDPval goes further by pulling tasks directly from the industries that contribute most to U.S. GDP.

The first release includes:

44 occupations across 9 major industries
1,320 specialized tasks, with an open-sourced “gold” set of 220
Tasks designed and vetted by professionals averaging 14+ years of experience

Instead of simple text prompts, GDPval tasks come with reference files and real-world deliverables like slides, spreadsheets, diagrams, and reports. This makes it a closer simulation of how AI could assist knowledge workers in practice.

What’s Next

GDPval is still an early step. Current evaluations are one-shot only, meaning they don’t yet capture iterative workflows where models refine outputs across drafts. OpenAI plans to expand GDPval toward interactive, context-rich tasks in future releases.

By grounding evaluation in real-world economic value, GDPval offers a more practical lens for tracking AI progress—and a clearer signal of how AI can contribute to work that matters.

You can read the full paper or explore the benchmark at evals.openai.com.