scan-v5.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan-v5.json (22568B)
      1 {
      2   "scan_version": 5,
      3   "paper_type": "benchmark-creation",
      4   "paper": {
      5     "title": "GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks",
      6     "authors": [
      7       "Tejal Patwardhan",
      8       "Rachel Dias",
      9       "Elizabeth Proehl",
     10       "Grace Kim",
     11       "Michele Wang",
     12       "Olivia Watkins",
     13       "Simón Posada Fishman",
     14       "Marwan Aljubeh",
     15       "Phoebe Thacker",
     16       "Laurance Fauconnet",
     17       "Natalie S. Kim",
     18       "Patrick Chao",
     19       "Samuel Miserendino",
     20       "Gildas Chabot",
     21       "David Li",
     22       "Michael Sharman",
     23       "Alexandra Barr",
     24       "Amelia Glaese",
     25       "Jerry Tworek"
     26     ],
     27     "year": 2025,
     28     "venue": "arXiv",
     29     "arxiv_id": "2510.04374",
     30     "doi": "10.48550/arXiv.2510.04374"
     31   },
     32   "checklist": {
     33     "claims_and_evidence": {
     34       "abstract_claims_supported": {
     35         "applies": true,
     36         "answer": true,
     37         "justification": "Core abstract claims—task coverage, expert credentials, linear improvement over time (Fig 6), models approaching expert parity (47.6% win rate), speed/cost savings (Table 2), and effects of reasoning/context/scaffolding—are each backed by specific results in the paper, though some are bounded to OpenAI models only.",
     38         "source": "haiku"
     39       },
     40       "causal_claims_justified": {
     41         "applies": true,
     42         "answer": false,
     43         "justification": "Ablation experiments on reasoning effort and prompt tuning provide reasonable support for those specific causal claims, but the headline claim that 'frontier model performance is improving roughly linearly over time' is a time-series observation of OpenAI model releases with no controls for confounds like task selection or expert pool changes.",
     44         "source": "haiku"
     45       },
     46       "generalization_bounded": {
     47         "applies": true,
     48         "answer": false,
     49         "justification": "The abstract frames findings about 'frontier models' broadly, but the linear improvement time series (Fig 6) covers only OpenAI models; the 'approaching industry experts' claim is based on 5 tasks per occupation in a 220-task subset, and the cost savings analysis excludes Claude, Gemini, and Grok entirely.",
     50         "source": "haiku"
     51       },
     52       "alternative_explanations_discussed": {
     53         "applies": true,
     54         "answer": false,
     55         "justification": "The paper does not discuss alternative explanations for key findings—e.g., that graders' ability to stylistically identify models (acknowledged in footnote 2) could systematically bias comparisons, or that OpenAI model improvement may reflect task exposure rather than genuine capability gains.",
     56         "source": "haiku"
     57       },
     58       "proxy_outcome_distinction": {
     59         "applies": true,
     60         "answer": false,
     61         "justification": "The paper derives specific dollar cost savings from pairwise win rates without adequately distinguishing between benchmark win rate in controlled conditions and actual productivity in real workflows; assumptions embedded in the Try-nx model (e.g., constant win rate across resamples) are acknowledged only in a footnote.",
     62         "source": "haiku"
     63       }
     64     },
     65     "limitations_and_scope": {
     66       "limitations_section_present": {
     67         "applies": true,
     68         "answer": true,
     69         "justification": "Section 5 is a dedicated Limitations section with five distinct subsections covering dataset size, knowledge work focus, task specification style, grader performance, and cost—well beyond a single concluding sentence.",
     70         "source": "haiku"
     71       },
     72       "threats_to_validity_specific": {
     73         "applies": true,
     74         "answer": true,
     75         "justification": "Specific threats are named: only 30 tasks per occupation limits subgroup analysis; self-reported task completion times may be under- or over-estimated; automated grader shows self-preference bias toward capable OpenAI models (Section A.6.2); expert graders could infer model identity from stylistic features (footnote 2).",
     76         "source": "haiku"
     77       },
     78       "scope_boundaries_stated": {
     79         "applies": true,
     80         "answer": true,
     81         "justification": "The paper explicitly states what GDPval does NOT cover: manual labor, physical tasks, tasks requiring extensive tacit knowledge, PII access, proprietary software, or communication between individuals, and calls the current version 'a limited, initial cut of knowledge work tasks.'",
     82         "source": "haiku"
     83       }
     84     },
     85     "conflicts_of_interest": {
     86       "funding_disclosed": {
     87         "applies": true,
     88         "answer": false,
     89         "justification": "No funding disclosure statement appears in the paper; it is presumably funded by OpenAI as an internal research project, but this is never explicitly stated.",
     90         "source": "haiku"
     91       },
     92       "affiliations_disclosed": {
     93         "applies": true,
     94         "answer": true,
     95         "justification": "All authors list OpenAI as their affiliation on the title page, clearly disclosed.",
     96         "source": "haiku"
     97       },
     98       "funder_independent_of_outcome": {
     99         "applies": true,
    100         "answer": false,
    101         "justification": "OpenAI employees created the benchmark, evaluated OpenAI models (GPT-4o through GPT-5), and the public automated grader is GPT-5-high—creating direct non-independence between funder and outcome; Section A.6.2 confirms the grader shows lower agreement with humans specifically when grading capable OpenAI models.",
    102         "source": "haiku"
    103       },
    104       "financial_interests_declared": {
    105         "applies": true,
    106         "answer": false,
    107         "justification": "No competing interests statement, patent disclosures, or financial interests declaration appears anywhere in the paper.",
    108         "source": "haiku"
    109       }
    110     },
    111     "scope_and_framing": {
    112       "key_terms_defined": {
    113         "applies": true,
    114         "answer": true,
    115         "justification": "'Economically valuable tasks' is operationalized via O*NET occupational data and GDP sector selection; 'win rate' is formally defined in Section A.6.1; 'digital occupation' is defined with a 60% digital-task threshold explained in Section A.7; 'frontier model' is implicitly defined by the specific models evaluated.",
    116         "source": "haiku"
    117       },
    118       "intended_contribution_clear": {
    119         "applies": true,
    120         "answer": true,
    121         "justification": "The conclusion explicitly enumerates five contributions: a new evaluation dataset, capability benchmarking analysis, experiments on reasoning/scaffolding, open-sourcing of 220 tasks, and release of an automated grader at evals.openai.com.",
    122         "source": "haiku"
    123       },
    124       "engagement_with_prior_work": {
    125         "applies": true,
    126         "answer": true,
    127         "justification": "The introduction explicitly contrasts GDPval with academic-difficulty benchmarks (MMLU, GPQA, HLE), domain-specific evals (SWE-Lancer), and lagging economic impact studies (Tamkin et al., Chatterji et al.), explaining how GDPval fills gaps in breadth, realism, and multi-modality.",
    128         "source": "haiku"
    129       }
    130     }
    131   },
    132   "type_checklist": {
    133     "benchmark-creation": {
    134       "construct_design": {
    135         "construct_validity_argued": {
    136           "applies": true,
    137           "answer": true,
    138           "justification": "The paper argues construct validity systematically: occupation selection is tied to GDP contribution and O*NET digital task classification, tasks are mapped to O*NET work activities for representativeness, and the digital-task measure is validated against Acemoglu & Autor (2011)'s established task-content framework.",
    139           "source": "haiku"
    140         },
    141         "difficulty_distribution_characterized": {
    142           "applies": true,
    143           "answer": true,
    144           "justification": "Tables 3 and 4 report difficulty distribution statistics (mean 3.32/3.20, std 0.95/0.92, range 1–5) based on expert self-ratings, and task time-to-complete is also distributed (median 5 hrs, 75th percentile 10 hrs, max 100 hrs for gold subset).",
    145           "source": "haiku"
    146         },
    147         "ceiling_floor_effects_checked": {
    148           "applies": true,
    149           "answer": false,
    150           "justification": "Ceiling/floor effects are not formally checked; GPT-4o's 12.5% win rate suggests potential floor effects for weaker models, but the paper cites the win-rate metric's theoretical non-saturation as sufficient rather than empirically verifying discrimination across the full ability range.",
    151           "source": "haiku"
    152         },
    153         "human_baseline_included": {
    154           "applies": true,
    155           "answer": true,
    156           "justification": "All model evaluations are conducted as pairwise comparisons against human expert deliverables, making the human expert the explicit and primary baseline throughout; Section 2.5 describes the blinded expert pairwise comparison methodology in detail.",
    157           "source": "haiku"
    158         },
    159         "scoring_rubric_justified": {
    160           "applies": true,
    161           "answer": true,
    162           "justification": "The paper justifies pairwise expert comparison as the primary metric because tasks involve subjective quality dimensions (aesthetics, style, relevance) that numeric rubrics cannot capture; expert graders are required to provide written justifications enabling the failure clustering analysis in Section 3.3.",
    163           "source": "haiku"
    164         }
    165       },
    166       "robustness": {
    167         "contamination_resistance_designed": {
    168           "applies": true,
    169           "answer": false,
    170           "justification": "There is no discussion of whether tasks or analogous work products appeared in frontier model training data; no temporal splits, canary strings, or anti-contamination measures are described, despite the benchmark being used to evaluate models likely trained on 2025 web data containing professional work outputs.",
    171           "source": "haiku"
    172         },
    173         "temporal_robustness_discussed": {
    174           "applies": true,
    175           "answer": true,
    176           "justification": "The paper explicitly frames GDPval as a 'first version' with planned expansions, and the win-rate metric is designed to remain useful as models improve by replacing the human baseline with stronger models or other reference points over time.",
    177           "source": "haiku"
    178         },
    179         "failure_modes_discussed": {
    180           "applies": true,
    181           "answer": true,
    182           "justification": "Failure modes of the benchmark itself are discussed: over-specified tasks (Section 5), coverage limited to digital knowledge work, grader self-preference bias for capable OpenAI models (A.6.2), model stylistic identifiability compromising blinding (footnote 2), and automated grader failures for specific task types (A.6.3).",
    183           "source": "haiku"
    184         },
    185         "baseline_implementations_provided": {
    186           "applies": true,
    187           "answer": true,
    188           "justification": "The 220 gold subset tasks are open-sourced with a public automated grader at evals.openai.com; model evaluation configurations are documented including tools enabled, sampling parameters, and the full list of pre-installed packages (Section A.6.4).",
    189           "source": "haiku"
    190         }
    191       },
    192       "documentation": {
    193         "dataset_documentation_complete": {
    194           "applies": true,
    195           "answer": true,
    196           "justification": "Documentation is extensive: expert recruitment criteria (min 4 years, background checks, training/quiz), task creation methodology, three-stage quality control pipeline with minimum 5 reviews, O*NET coverage statistics (Tables 5–7), and detailed task statistics covering difficulty, time, cost, and file types.",
    197           "source": "haiku"
    198         },
    199         "licensing_and_access_clear": {
    200           "applies": true,
    201           "answer": false,
    202           "justification": "The paper states tasks are 'open-sourced' at evals.openai.com but specifies no license terms; Section A.1.3 notes third-party brand references and privacy scrubbing but does not state under what terms researchers may use, redistribute, or build on the dataset.",
    203           "source": "haiku"
    204         },
    205         "intended_use_specified": {
    206           "applies": true,
    207           "answer": false,
    208           "justification": "While limitations are discussed and the automated grader is labeled 'experimental,' the paper does not explicitly state what conclusions should or should not be drawn from GDPval results, nor does it caution against treating benchmark win rates as direct proxies for labor market impact or deployment productivity.",
    209           "source": "haiku"
    210         }
    211       }
    212     }
    213   },
    214   "claims": [
    215     {
    216       "claim": "Frontier model performance on GDPval is improving roughly linearly over time",
    217       "evidence": "Figure 6 shows OpenAI model win rates rising from GPT-4o (12.5%) through o4-mini (29.1%), o3 (35.2%), to GPT-5 (39.0%) in a roughly linear progression",
    218       "supported": "weak"
    219     },
    220     {
    221       "claim": "The current best frontier models are approaching industry experts in deliverable quality",
    222       "evidence": "47.6% of Claude Opus 4.1 deliverables were graded as better than or equal to the human expert deliverable on the 220-task gold subset",
    223       "supported": "moderate"
    224     },
    225     {
    226       "claim": "Increased reasoning effort predictably improves model performance on GDPval",
    227       "evidence": "Controlled experiment varying o3 and GPT-5 reasoning effort at low/medium/high levels shows monotonically increasing win rates (Figure 9a)",
    228       "supported": "strong"
    229     },
    230     {
    231       "claim": "Prompt tuning and scaffolding improvements significantly increase GPT-5 performance",
    232       "evidence": "Prompting eliminated black-square PDF artifacts (previously >50%), reduced PPTX formatting errors from 86% to 64%, and increased win rates by 5 percentage points; modal use of multimodal inspection rose from 15% to 97%",
    233       "supported": "strong"
    234     },
    235     {
    236       "claim": "Incorporating frontier AI models with human oversight can save time and money compared to unaided experts",
    237       "evidence": "Under the 'Try nx' scenario, GPT-5 provides 1.39x speed improvement and 1.63x cost improvement over unaided experts; GPT-4o does NOT save time (0.46x) under the same scenario (Table 2)",
    238       "supported": "moderate"
    239     },
    240     {
    241       "claim": "The GPT-5-based automated grader achieves agreement with human experts within 5% of human inter-rater agreement",
    242       "evidence": "Section A.6.2 reports human-automated grader agreement of 65.7% vs. human inter-rater agreement of 70.8% across three grader sweeps, with 95% confidence intervals from bootstrapping",
    243       "supported": "strong"
    244     },
    245     {
    246       "claim": "Claude Opus 4.1 excels at aesthetic/visual tasks while GPT-5 high excels at accuracy/instruction-following",
    247       "evidence": "Expert failure clustering in Section 3.3 and Figure 8 shows Claude and Grok most often lose to instruction-following failures while GPT-5 loses mainly to formatting errors; Figure 12 shows Claude leads on all non-text file types while GPT-5 leads on pure text",
    248       "supported": "moderate"
    249     }
    250   ],
    251   "methodology_tags": [
    252     "benchmark-eval",
    253     "observational"
    254   ],
    255   "key_findings": "GDPval is a new benchmark of 1,320 tasks across 44 occupations in 9 US GDP sectors created by industry professionals averaging 14 years of experience; the open-sourced gold subset contains 220 tasks. The best frontier model tested (Claude Opus 4.1) achieves a 47.6% win-or-tie rate against human experts, interpreted as 'approaching parity.' OpenAI model win rates have increased roughly linearly from GPT-4o (12.5%) to GPT-5 (39.0%). Reasoning effort, increased task context, and prompt engineering each independently improve performance, with the most capable models losing primarily due to instruction-following failures rather than factual errors.",
    256   "red_flags": [
    257     {
    258       "flag": "Self-evaluation by benchmark creator",
    259       "detail": "OpenAI created the benchmark, evaluated its own models (GPT-4o through GPT-5), and the public automated grader is GPT-5-high—the same company whose models are being evaluated controls the primary automated scoring infrastructure."
    260     },
    261     {
    262       "flag": "Compromised blinding",
    263       "detail": "Expert graders could identify model outputs by stylistic features: OpenAI models used em dashes, Claude used first-person phrasing, Grok occasionally identified itself. The paper acknowledges this in footnote 2 but did not alter style to preserve blinding, undermining the validity of head-to-head comparisons."
    264     },
    265     {
    266       "flag": "Linear improvement claim limited to OpenAI models",
    267       "detail": "The abstract claims 'frontier model performance on GDPval is improving roughly linearly over time,' but Fig 6 is explicitly titled 'Performance of OpenAI frontier models'—the time-series data covers only one lab's model progression."
    268     },
    269     {
    270       "flag": "Cost analysis excludes non-OpenAI models",
    271       "detail": "Table 2 cost/speed analysis covers only OpenAI models; the paper explicitly states 'We were not able to obtain cost estimates for Claude, Gemini, and Grok,' making the economic impact analysis structurally incomplete for cross-lab comparisons."
    272     },
    273     {
    274       "flag": "No contamination analysis",
    275       "detail": "Tasks were created in 2025 from real professional work products; frontier models trained on 2025 web data could have been exposed to analogous tasks, reference materials, or expert outputs. No contamination analysis is performed."
    276     },
    277     {
    278       "flag": "Automated grader self-preference bias",
    279       "detail": "Section A.6.2 confirms the GPT-5-based automated grader shows systematically lower agreement with human experts when grading capable OpenAI models, consistent with self-preference bias (Panickssery et al. 2024)—yet this grader is the sole public evaluation interface."
    280     },
    281     {
    282       "flag": "Thin gold subset for occupation-level inference",
    283       "detail": "Only 5 tasks per occupation in the open-sourced set makes occupation-level win rates (Figs 11, 12) statistically unreliable; win rate differences between occupations are presented without confidence intervals."
    284     }
    285   ],
    286   "cited_papers": [
    287     {
    288       "title": "GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models (Eloundou et al., 2023)",
    289       "relevance": "Foundational occupational AI exposure framework that GDPval's digital-task classification methodology directly builds upon"
    290     },
    291     {
    292       "title": "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (Miserendino et al., 2025)",
    293       "relevance": "Domain-specific real-world task benchmark GDPval explicitly generalizes beyond with multi-sector coverage"
    294     },
    295     {
    296       "title": "Clio: Privacy-Preserving Insights into Real-World AI Use (Tamkin et al., 2024)",
    297       "relevance": "Production AI usage analysis used to inform which occupational areas have emerging but uneven model adoption in GDPval scope"
    298     },
    299     {
    300       "title": "Skills, Tasks and Technologies: Implications for Employment and Earnings (Acemoglu & Autor, 2011)",
    301       "relevance": "Task-content framework used to validate GDPval's digital-task classification against established economic measures of cognitive vs. manual work"
    302     },
    303     {
    304       "title": "LLM Evaluators Recognize and Favor Their Own Generations (Panickssery et al., 2024)",
    305       "relevance": "Directly cited to explain the automated grader self-preference bias observed in GDPval's Section A.6.2"
    306     },
    307     {
    308       "title": "Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)",
    309       "relevance": "Canonical academic-style benchmark GDPval contrasts against to justify its realism-over-difficulty design philosophy"
    310     },
    311     {
    312       "title": "Generative AI at Work (Brynjolfsson, Li & Raymond, 2025)",
    313       "relevance": "Empirical RCT on AI productivity gains cited as context for the economic impact GDPval aims to anticipate ahead of widespread adoption"
    314     },
    315     {
    316       "title": "Humanity's Last Exam (Phan et al., 2025)",
    317       "relevance": "Hard academic benchmark GDPval explicitly distinguishes from by targeting economically representative rather than difficulty-maximizing tasks"
    318     }
    319   ],
    320   "engagement_factors": {
    321     "practical_relevance": {
    322       "score": 3,
    323       "justification": "220 tasks open-sourced with public automated grader at evals.openai.com; cost/time savings analysis gives practitioners direct guidance on AI-assisted workflow ROI."
    324     },
    325     "surprise_contrarian": {
    326       "score": 1,
    327       "justification": "The core finding that frontier models are 'approaching' but not yet matching human experts is broadly expected; no finding materially challenges prevailing views."
    328     },
    329     "fear_safety": {
    330       "score": 2,
    331       "justification": "Directly quantifies economic value at risk ($391 average task value across $3T in annual compensation) and models labor displacement scenarios, making AI labor market threat concrete and quantified."
    332     },
    333     "drama_conflict": {
    334       "score": 2,
    335       "justification": "Cross-model horse race (GPT-5 vs Claude Opus 4.1 vs Gemini 2.5 Pro vs Grok 4) creates competitive comparison; OpenAI self-evaluation using its own models as the automated grader invites methodological controversy."
    336     },
    337     "demo_ability": {
    338       "score": 3,
    339       "justification": "Public automated grader at evals.openai.com plus open-sourced 220 tasks make this immediately runnable by anyone wanting to evaluate a new model."
    340     },
    341     "brand_recognition": {
    342       "score": 3,
    343       "justification": "OpenAI paper with GPT-5 results plus direct comparisons against Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4—maximum brand recognition across all major frontier model providers."
    344     }
    345   },
    346   "hn_data": {
    347     "threads": [
    348       {
    349         "hn_id": "33314496",
    350         "title": "A study of malicious CVE proof of concept exploits in GitHub",
    351         "points": 3,
    352         "comments": 0,
    353         "url": "https://news.ycombinator.com/item?id=33314496",
    354         "created_at": "2022-10-24T09:30:54Z"
    355       },
    356       {
    357         "hn_id": "45836230",
    358         "title": "The Distribution of Earth-Impacting Interstellar Objects",
    359         "points": 1,
    360         "comments": 0,
    361         "url": "https://news.ycombinator.com/item?id=45836230",
    362         "created_at": "2025-11-06T15:21:24Z"
    363       }
    364     ],
    365     "top_points": 3,
    366     "total_points": 4,
    367     "total_comments": 0
    368   }
    369 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs