scan-v4.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan-v4.json (35988B)
      1 {
      2   "scan_version": 4,
      3   "paper_type": "empirical",
      4   "paper": {
      5     "title": "Generative AI at Work",
      6     "authors": [
      7       "Erik Brynjolfsson",
      8       "Danielle Li",
      9       "Lindsey Raymond"
     10     ],
     11     "year": 2023,
     12     "venue": "Social Science Research Network",
     13     "arxiv_id": "2304.11771",
     14     "doi": "10.3386/w31161"
     15   },
     16   "checklist": {
     17     "claims_and_evidence": {
     18       "abstract_claims_supported": {
     19         "applies": true,
     20         "answer": true,
     21         "justification": "All abstract claims are supported: 15% productivity increase (Table 2, Col 3: 0.301/1.97 = 15.2%), heterogeneity by skill (Figure 3), worker learning from outage analysis (Figure 6), gains largest for rare problems (Figure 7), customer sentiment improvement and reduced escalation (Figure 10, Table 4).",
     22         "source": "opus"
     23       },
     24       "causal_claims_justified": {
     25         "applies": true,
     26         "answer": true,
     27         "justification": "The paper uses difference-in-differences with staggered rollout, includes a small RCT pilot (Section 4.1.1, Appendix Table A.1), instruments individual adoption with team-level adoption timing (Appendix Table A.2), tests parallel trends via event studies, and uses multiple robust DiD estimators. The causal identification strategy is well-justified for the main productivity claims.",
     28         "source": "opus"
     29       },
     30       "generalization_bounded": {
     31         "applies": true,
     32         "answer": true,
     33         "justification": "The paper explicitly bounds generalization: 'these findings apply for a particular AI tool, used in a single firm, within a single occupation, and should not be generalized across all occupations and AI systems' (Section 7). It also notes inability to observe wages, overall labor demand, or skill composition changes.",
     34         "source": "opus"
     35       },
     36       "alternative_explanations_discussed": {
     37         "applies": true,
     38         "answer": true,
     39         "justification": "Multiple alternative explanations are discussed: mean reversion (Section 4.2.1, Figure A.9), selection into treatment (addressed via agent FE and IV in Appendix Table A.2), selection on adherence vs. causal effect of following recommendations (Section 5.1), Hawthorne effects (implicitly via outage analysis), and confounds in attrition estimates (Section 6.3).",
     40         "source": "opus"
     41       },
     42       "proxy_outcome_distinction": {
     43         "applies": true,
     44         "answer": true,
     45         "justification": "The paper carefully distinguishes its proxy measures from broader outcomes. RPH is presented as a specific industry metric, not a general productivity measure. The paper explicitly discusses what it cannot observe: 'we do not have access to pay data,' 'our paper is not designed to shed light on the aggregate employment or wage effects,' and discusses equilibrium effects it cannot measure (Section 7).",
     46         "source": "opus"
     47       }
     48     },
     49     "limitations_and_scope": {
     50       "limitations_section_present": {
     51         "applies": true,
     52         "answer": true,
     53         "justification": "The Conclusion (Section 7) contains extensive limitations discussion spanning multiple paragraphs covering single-firm generalization, equilibrium effects, wage data absence, potential ratchet effects on performance targets, and longer-run incentive challenges.",
     54         "source": "opus"
     55       },
     56       "threats_to_validity_specific": {
     57         "applies": true,
     58         "answer": true,
     59         "justification": "Specific threats are discussed throughout: manager selection bias in onboarding (addressed via IV, Section 4.1.2), mean reversion concern (Section 4.2.1, Figure A.9), inability to distinguish voluntary from involuntary attrition (Appendix A.3.5), lack of agent fixed effects for attrition (Section 6.3), and potential over-reliance on AI by top performers degrading future model quality.",
     60         "source": "opus"
     61       },
     62       "scope_boundaries_stated": {
     63         "applies": true,
     64         "answer": true,
     65         "justification": "Explicitly stated: 'our paper is not designed to shed light on the aggregate employment or wage effects' (Section 1), 'these findings apply for a particular AI tool, used in a single firm, within a single occupation' (Section 7), 'we do not have access to pay data' and cannot observe 'longer run equilibrium responses in worker demand or job design' (Section 7).",
     66         "source": "opus"
     67       }
     68     },
     69     "conflicts_of_interest": {
     70       "funding_disclosed": {
     71         "applies": true,
     72         "answer": true,
     73         "justification": "Funding is disclosed in the footnote: 'We thank... the Stanford Digital Economy Lab for funding. The content is solely the responsibility of the authors and does not necessarily represent the official views of Stanford University, MIT, or the NBER.'",
     74         "source": "opus"
     75       },
     76       "affiliations_disclosed": {
     77         "applies": true,
     78         "answer": true,
     79         "justification": "Author affiliations are clearly stated: Brynjolfsson (Stanford & NBER), Li (MIT & NBER), Raymond (MIT). The paper refers to the AI firm and data firm anonymously but the authors are academic researchers, not employees of either firm.",
     80         "source": "opus"
     81       },
     82       "funder_independent_of_outcome": {
     83         "applies": true,
     84         "answer": true,
     85         "justification": "The Stanford Digital Economy Lab is an academic research center with no apparent financial stake in whether AI tools increase or decrease productivity. The funder appears independent of the outcome.",
     86         "source": "opus"
     87       },
     88       "financial_interests_declared": {
     89         "applies": true,
     90         "answer": false,
     91         "justification": "No competing interests or financial interests statement is included in the paper. While the authors appear to be independent academics, the absence of an explicit declaration is NO per the schema criteria.",
     92         "source": "opus"
     93       }
     94     },
     95     "scope_and_framing": {
     96       "key_terms_defined": {
     97         "applies": true,
     98         "answer": true,
     99         "justification": "Footnote 1 defines AI, ML, LLMs, and generative AI precisely; 'productivity' is operationalized through explicit metrics (RPH, AHT, CPH, resolution rate); 'adherence' is defined quantitatively in Section 5.1; the 'skill index' construction is detailed in Appendix A.2.",
    100         "source": "haiku"
    101       },
    102       "intended_contribution_clear": {
    103         "applies": true,
    104         "answer": true,
    105         "justification": "The paper identifies itself as 'the first study of the impact of generative AI deployed at scale in the workplace,' with three explicit sets of findings on productivity, mechanisms, and worker experience of work.",
    106         "source": "haiku"
    107       },
    108       "engagement_with_prior_work": {
    109         "applies": true,
    110         "answer": true,
    111         "justification": "The introduction engages substantively with the skill-biased technical change literature, prior AI adoption studies, and contemporaneous lab studies (Peng et al., Noy & Zhang, Dell'Acqua et al., Choi & Schwarcz), explicitly situating how this paper differs in real-world scope and duration.",
    112         "source": "haiku"
    113       }
    114     }
    115   },
    116   "type_checklist": {
    117     "empirical": {
    118       "artifacts": {
    119         "code_released": {
    120           "applies": true,
    121           "answer": false,
    122           "justification": "No code repository or analysis scripts are provided or linked in the paper.",
    123           "source": "opus"
    124         },
    125         "data_released": {
    126           "applies": true,
    127           "answer": false,
    128           "justification": "The data comes from a Fortune 500 firm's proprietary customer service records. No data is released; this is understandable for confidentiality but still a NO.",
    129           "source": "opus"
    130         },
    131         "environment_specified": {
    132           "applies": true,
    133           "answer": false,
    134           "justification": "No environment specifications, software versions, or computational setup details are provided for the econometric analysis.",
    135           "source": "opus"
    136         },
    137         "reproduction_instructions": {
    138           "applies": true,
    139           "answer": false,
    140           "justification": "No step-by-step reproduction instructions are included. The econometric specifications are described (Equations 1-7, Appendix A.3) but no runnable code or scripts are provided.",
    141           "source": "opus"
    142         }
    143       },
    144       "statistical_methodology": {
    145         "confidence_intervals_or_error_bars": {
    146           "applies": true,
    147           "answer": true,
    148           "justification": "95% confidence intervals are shown on all event study figures (Figures 2, 3, 6, 8, 10, etc.) and robust standard errors are reported in parentheses in all regression tables.",
    149           "source": "opus"
    150         },
    151         "significance_tests": {
    152           "applies": true,
    153           "answer": true,
    154           "justification": "Statistical significance is reported throughout with asterisks (*** p<0.01, ** p<0.05, * p<0.10) and clustered robust standard errors in all tables (Tables 2-4, A.1-A.11).",
    155           "source": "opus"
    156         },
    157         "effect_sizes_reported": {
    158           "applies": true,
    159           "answer": true,
    160           "justification": "Effect sizes are reported with baseline context throughout: e.g., '0.30 chats or 15.2%' off pre-treatment mean of 1.97 RPH (Table 2), '3.7 minute decrease... an 8.5% decline from the baseline mean of 43 minutes' (Table 3), customer sentiment improves by 0.18 points 'equivalent to half of a standard deviation' (Section 6.1).",
    161           "source": "opus"
    162         },
    163         "sample_size_justified": {
    164           "applies": true,
    165           "answer": false,
    166           "justification": "The sample size of 5,172 agents and 3 million chats is large and naturally determined by the firm's workforce, but no power analysis or formal sample size justification is provided.",
    167           "source": "opus"
    168         },
    169         "variance_reported": {
    170           "applies": true,
    171           "answer": true,
    172           "justification": "Standard deviations are reported in summary statistics (Table 1, e.g., 'St. Average Handle Time 23-24 min'). Robust standard errors clustered at agent level are reported for all regressions. Multiple estimators are compared (Appendix Table A.9, Figure A.4).",
    173           "source": "opus"
    174         }
    175       },
    176       "evaluation_design": {
    177         "baselines_included": {
    178           "applies": true,
    179           "answer": true,
    180           "justification": "The difference-in-differences design uses pre-treatment observations and never-treated agents as baselines. The paper also compares against multiple alternative DiD estimators (Sun-Abraham, Callaway-Sant'Anna, Borusyak et al., de Chaisemartin-D'Haultfœuille).",
    181           "source": "opus"
    182         },
    183         "baselines_contemporary": {
    184           "applies": true,
    185           "answer": true,
    186           "justification": "The robust DiD estimators used (Sun and Abraham 2021, Borusyak et al. 2022, Callaway and Sant'Anna 2021) represent the state of the art in causal inference for staggered adoption designs.",
    187           "source": "opus"
    188         },
    189         "ablation_study": {
    190           "applies": true,
    191           "answer": true,
    192           "justification": "The paper systematically decomposes productivity into components (AHT, CPH, resolution rate, NPS in Table 3), examines heterogeneity by skill quintile (Figure 3A), tenure (Figure 3B), adherence (Figure 5B), topic frequency (Figure 7), and examines outage periods to isolate learning from real-time assistance (Figure 6).",
    193           "source": "opus"
    194         },
    195         "multiple_metrics": {
    196           "applies": true,
    197           "answer": true,
    198           "justification": "Five main outcome metrics: resolutions per hour, average handle time, chats per hour, resolution rate, and net promoter score (Tables 2-3). Additional metrics include customer sentiment, agent sentiment, manager escalation rate, attrition, language fluency, and textual similarity.",
    199           "source": "opus"
    200         },
    201         "human_evaluation": {
    202           "applies": true,
    203           "answer": true,
    204           "justification": "Human evaluation is used to validate LLM-generated topic classifications (3 independent human evaluators on 100 conversations, Appendix A.2.6) and language fluency scores (2 independent human reviewers on 100 conversations, Appendix A.2.5).",
    205           "source": "opus"
    206         },
    207         "held_out_test_set": {
    208           "applies": false,
    209           "answer": false,
    210           "justification": "This is an observational/quasi-experimental study, not a prediction task. The concept of held-out test sets does not apply to DiD causal inference designs.",
    211           "source": "opus"
    212         },
    213         "per_category_breakdown": {
    214           "applies": true,
    215           "answer": true,
    216           "justification": "Extensive breakdowns by worker skill quintile (Figure 3A, A.6), tenure group (Figure 3B, A.7), adherence quintile (Figure 5B, A.12), topic frequency (Figure 7), agent location (Figure 8C-D), and adoption cohort (Figure A.10).",
    217           "source": "opus"
    218         },
    219         "failure_cases_discussed": {
    220           "applies": true,
    221           "answer": true,
    222           "justification": "The paper identifies where AI assistance fails or has negative effects: highest-skilled workers see 'small declines in quality' (Section 4.2.1, Figure A.6 Panels C-D showing negative effects on resolution rate and NPS for Q5 workers). The paper also discusses potential over-reliance by top performers.",
    223           "source": "opus"
    224         },
    225         "negative_results_reported": {
    226           "applies": true,
    227           "answer": true,
    228           "justification": "Several negative results: no significant impact on customer satisfaction (NPS) overall (Table 3 Col 4); negative quality effects for highest-skilled workers (Panels C-D of Figure A.6); top performers' over-reliance on AI may reduce future model quality (Section 5.1). The paper also notes that AI recommendations may 'distract top performers' (Section 4.2.1).",
    229           "source": "opus"
    230         }
    231       },
    232       "setup_transparency": {
    233         "model_versions_specified": {
    234           "applies": true,
    235           "answer": false,
    236           "justification": "The paper says the tool 'is built on a recent version of the Generative Pre-trained Transformer (GPT) family of large language models developed by OpenAI' (Section 2.2) but does not specify which GPT version (GPT-3, GPT-3.5, etc.) or any version identifier.",
    237           "source": "opus"
    238         },
    239         "prompts_provided": {
    240           "applies": false,
    241           "answer": false,
    242           "justification": "The AI tool is a deployed commercial product that generates real-time suggestions; the authors are studying its workplace impact, not designing or controlling prompts. The prompts/fine-tuning are internal to the AI firm.",
    243           "source": "opus"
    244         },
    245         "hyperparameters_reported": {
    246           "applies": true,
    247           "answer": false,
    248           "justification": "No hyperparameters for the AI model (temperature, sampling settings) or fine-tuning process are reported. The econometric specifications are well-documented but the AI system's technical parameters are not.",
    249           "source": "opus"
    250         },
    251         "scaffolding_described": {
    252           "applies": false,
    253           "answer": false,
    254           "justification": "The paper evaluates a third-party commercial AI tool as deployed. The authors describe the tool's two main outputs (suggested responses and documentation links, Section 2.3, Appendix Figure A.1) but cannot be expected to describe internal scaffolding they have no access to.",
    255           "source": "opus"
    256         },
    257         "data_preprocessing_documented": {
    258           "applies": true,
    259           "answer": true,
    260           "justification": "Appendix A.1 describes sample construction in detail: starting with 3,006,395 chats, dropping single-message chats, merging across databases via chat identifiers, winsorizing call duration at 99th percentile, and aggregating to agent-month level. Variable construction is detailed in Appendix A.2.",
    261           "source": "opus"
    262         }
    263       },
    264       "data_integrity": {
    265         "raw_data_available": {
    266           "applies": true,
    267           "answer": false,
    268           "justification": "Raw data is proprietary (Fortune 500 firm's customer service records) and not available for independent verification. Understandable given confidentiality constraints, but still NO.",
    269           "source": "opus"
    270         },
    271         "data_collection_described": {
    272           "applies": true,
    273           "answer": true,
    274           "justification": "Data collection is described in Appendix A.1: chat conversations from the firm's software systems (September 2019–June 2021), merged with internal company datasets and AI firm records. Agent information includes employer, location, manager/team, tenure, and AI onboarding date.",
    275           "source": "opus"
    276         },
    277         "recruitment_methods_described": {
    278           "applies": true,
    279           "answer": true,
    280           "justification": "Agent selection for AI treatment is described: managers scheduled onboarding to minimize customer disruption, training sessions were limited by AI firm capacity, contractual license limits applied, and replacement occurred when AI-enabled agents left (Section 3.1). Agent employment details are in Section 2.2 and Appendix A.2.2.",
    281           "source": "opus"
    282         },
    283         "data_pipeline_documented": {
    284           "applies": true,
    285           "answer": true,
    286           "justification": "Appendix A.1 documents the pipeline: starting sample (3M chats, 5,172 agents), merging steps using chat identifiers, dropping criteria (single-message chats, missing times/identifiers), winsorization (99th percentile), and aggregation to agent-month level. Topic classification (Appendix A.2.6) and sentiment scoring (A.2.4) pipelines are also documented.",
    287           "source": "opus"
    288         }
    289       },
    290       "contamination": {
    291         "training_cutoff_stated": {
    292           "applies": false,
    293           "answer": false,
    294           "justification": "This study examines the impact of a deployed AI tool on worker productivity. It does not evaluate a pre-trained model's capability on any benchmark.",
    295           "source": "opus"
    296         },
    297         "train_test_overlap_discussed": {
    298           "applies": false,
    299           "answer": false,
    300           "justification": "Not a benchmark evaluation study. The paper studies workplace outcomes, not model capability on test sets.",
    301           "source": "opus"
    302         },
    303         "benchmark_contamination_addressed": {
    304           "applies": false,
    305           "answer": false,
    306           "justification": "Not a benchmark evaluation study.",
    307           "source": "opus"
    308         }
    309       },
    310       "human_studies": {
    311         "pre_registered": {
    312           "applies": true,
    313           "answer": false,
    314           "justification": "No pre-registration is mentioned for either the main observational study or the small RCT pilot. No link to OSF, AsPredicted, AEA registry, or similar.",
    315           "source": "opus"
    316         },
    317         "irb_or_ethics_approval": {
    318           "applies": true,
    319           "answer": false,
    320           "justification": "No IRB or ethics board approval is mentioned anywhere in the paper, despite studying 5,172 workers whose chat data and performance records were analyzed.",
    321           "source": "opus"
    322         },
    323         "demographics_reported": {
    324           "applies": true,
    325           "answer": true,
    326           "justification": "Demographics are reported: 89% of agents are outside the US, mainly in the Philippines (Section 2.2, Table 1). Agent tenure distribution, employer type (direct vs. subcontractor), and geographic distribution across 25 locations are provided. Detailed breakdown in Table 1.",
    327           "source": "opus"
    328         },
    329         "inclusion_exclusion_criteria": {
    330           "applies": true,
    331           "answer": true,
    332           "justification": "Inclusion criteria are described: agents providing chat-based technical support for US-based small businesses at the data firm and its subcontractors. Exclusions: chats with only one message, missing start/end times, missing identifiers (Appendix A.1). Agent sample construction is documented.",
    333           "source": "opus"
    334         },
    335         "randomization_described": {
    336           "applies": true,
    337           "answer": true,
    338           "justification": "The staggered rollout mechanism is described in detail (Section 3.1): limited training session capacity, manager scheduling to minimize disruption, contractual license limits. The small RCT pilot is mentioned (Section 4.1.1, ~50 workers, half randomized to treatment), though limited detail on the randomization mechanism. The IV approach instruments individual adoption with team-level timing (Section 4.1.2).",
    339           "source": "opus"
    340         },
    341         "blinding_described": {
    342           "applies": true,
    343           "answer": false,
    344           "justification": "No blinding is described. Agents knew whether they had AI access (they received a 3-hour onboarding training). Whether managers or customers were blinded is not discussed.",
    345           "source": "opus"
    346         },
    347         "attrition_reported": {
    348           "applies": true,
    349           "answer": true,
    350           "justification": "Worker attrition is explicitly analyzed as an outcome (Section 6.3, Figure 11, Table A.11). The paper reports baseline attrition rate of 28.8% (Table A.11 DV Mean) and notes that attrition analysis drops pre-treatment observations for treated agents because they 'must survive to be treated' (Section 6.3, Appendix A.3.5).",
    351           "source": "opus"
    352         }
    353       },
    354       "cost_and_practicality": {
    355         "inference_cost_reported": {
    356           "applies": true,
    357           "answer": false,
    358           "justification": "No costs of the AI system are reported. The paper mentions 'generative AI was costly and relatively untested' (Section 3.1) and that the firm had a 'limited budget for its deployment' but provides no specific cost figures.",
    359           "source": "opus"
    360         },
    361         "compute_budget_stated": {
    362           "applies": true,
    363           "answer": false,
    364           "justification": "No computational budget is stated for either the AI system or the econometric analysis.",
    365           "source": "opus"
    366         }
    367       }
    368     }
    369   },
    370   "claims": [
    371     {
    372       "claim": "Access to AI assistance increases worker productivity by 15% on average, as measured by issues resolved per hour.",
    373       "evidence": "DiD regression with agent and tenure fixed effects yields a coefficient of 0.301 RPH (Table 2, Column 3), 15.2% above the pre-treatment mean of 1.97 RPH; replicated across four robust DiD estimators (Table A.9).",
    374       "supported": "strong"
    375     },
    376     {
    377       "claim": "Less experienced and lower-skilled workers benefit disproportionately from AI assistance, with the lowest skill quintile seeing ~36% productivity gains versus near-zero for the highest quintile.",
    378       "evidence": "Figure 3A shows monotonically decreasing treatment effects from Q1 (0.527 RPH, ~36%) to Q5 (0.015 RPH, insignificant); Figure 3B shows analogous monotonic pattern by tenure, with <1-month agents gaining 0.707 RPH versus essentially zero for >12-month agents (Table A.6).",
    379       "supported": "strong"
    380     },
    381     {
    382       "claim": "AI assistance facilitates durable worker learning rather than mere reliance; workers retain productivity gains even when the AI system is unavailable during outages.",
    383       "evidence": "Figure 6 (Panel B vs. A) shows productivity improvements during outage periods grow with cumulative AI exposure time, unlike non-outage periods which show stable immediate gains; Panels C/D show this learning effect only occurs for high-adherence workers.",
    384       "supported": "moderate"
    385     },
    386     {
    387       "claim": "AI gains are largest for moderately rare customer problems where agents have less baseline experience but the AI has sufficient training data.",
    388       "evidence": "Figure 7A shows a non-monotonic pattern: largest gains (5–6 minute reduction) for 75th–90th percentile rarity topics vs. 4–5 minutes for most common and 4 minutes for rarest; Figure 7B shows a monotonic relationship when controlling for overall frequency, with least agent-familiar problems showing 15% vs. 10% reductions.",
    389       "supported": "strong"
    390     },
    391     {
    392       "claim": "AI assistance improves customer sentiment by half a standard deviation and reduces manager escalation requests by ~25%.",
    393       "evidence": "Table 4 shows customer sentiment coefficient of 0.177 (p<0.01, baseline mean=0.141, representing ~0.5 SD); Table 4 Column 3 shows -0.00875 escalation rate effect vs. baseline of 3.77%, approximately 23% reduction.",
    394       "supported": "strong"
    395     },
    396     {
    397       "claim": "AI assistance improves English language fluency particularly for Philippines-based agents.",
    398       "evidence": "Figure 8 shows significant event-study increases in comprehensibility and native fluency; Panels C/D show larger effects for Filipino agents, validated by human rater comparison showing LLM scores match human evaluators (p=0.22 and p=0.12 for no statistically significant difference).",
    399       "supported": "strong"
    400     },
    401     {
    402       "claim": "AI assistance reduces worker attrition by approximately 40% among newer agents.",
    403       "evidence": "Table A.11 shows -0.095 to -0.121 reduction in monthly attrition probability for agents with 0–2 months tenure, off a baseline of ~25%; the authors caveat that agent fixed effects cannot be included, so selection bias cannot be ruled out.",
    404       "supported": "moderate"
    405     }
    406   ],
    407   "methodology_tags": [
    408     "observational",
    409     "rct"
    410   ],
    411   "key_findings": "A staggered quasi-natural experiment across 5,172 customer support agents finds that a GPT-based conversational AI assistant increases productivity (issues resolved per hour) by 15% on average, with substantially larger gains for novice and lower-skilled workers (~36% for lowest skill quintile vs. near-zero for highest) — inverting the traditional skill-biased technology complementarity pattern. Evidence from AI system outages demonstrates that gains partly reflect durable learning: workers who closely followed AI recommendations continued to outperform their pre-AI baseline even when the system was unavailable. Beyond productivity, AI assistance improved customer sentiment by half a standard deviation, reduced manager escalation requests by ~25%, decreased attrition among new workers by ~40%, and improved English fluency particularly for international agents.",
    412   "red_flags": [
    413     {
    414       "flag": "GPT version unspecified",
    415       "detail": "The AI tool is described only as 'built on a recent version of the GPT family'; no specific model version or training snapshot date is identified, making findings impossible to attribute to a specific system or replicate as AI capabilities evolve."
    416     },
    417     {
    418       "flag": "No IRB or ethics disclosure",
    419       "detail": "The study continuously monitored the communications and performance of 5,172 workers over 21 months without mentioning any IRB approval or informed consent process, raising ethical concerns given the workplace surveillance involved."
    420     },
    421     {
    422       "flag": "Proprietary data, no replication possible",
    423       "detail": "All data is proprietary to an unnamed private firm; neither raw data nor analysis code is released, making independent replication or verification of the results impossible."
    424     },
    425     {
    426       "flag": "No financial interests declaration",
    427       "detail": "No competing interests statement is provided; it is unclear whether authors have equity, consulting, or advisory relationships with OpenAI or the unnamed AI firm whose tool is the primary subject of study."
    428     },
    429     {
    430       "flag": "Attrition regression selection bias",
    431       "detail": "The attrition analysis cannot include agent fixed effects because attrition only occurs once per worker; the authors acknowledge this may overstate AI effects on retention if the firm preferentially assigned AI access to agents deemed more likely to stay."
    432     },
    433     {
    434       "flag": "Small RCT incomplete and underpowered",
    435       "detail": "The randomized pilot involved ~50 workers with control group members unidentifiable in the data (only treatment group of 22 can be recovered); main causal identification relies entirely on the quasi-experimental staggered rollout."
    436     }
    437   ],
    438   "cited_papers": [
    439     {
    440       "title": "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot",
    441       "relevance": "Peng et al. (2023) — key comparator lab study of AI coding assistant finding 2x speed improvement in a controlled task; this paper extends to real-world multi-month deployment with longitudinal learning effects"
    442     },
    443     {
    444       "title": "Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence",
    445       "relevance": "Noy and Zhang (2023) — RCT showing ChatGPT improves professional writing task performance; companion study finding skill-compression consistent with this paper's heterogeneity results"
    446     },
    447     {
    448       "title": "Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality",
    449       "relevance": "Dell'Acqua et al. (2023) — BCG consulting study finding GPT-4 helps within capability frontier but hurts outside it; directly analogous to this paper's finding that top performers can be harmed by AI"
    450     },
    451     {
    452       "title": "Estimating dynamic treatment effects in event studies with heterogeneous treatment effects",
    453       "relevance": "Sun and Abraham (2021) — primary causal identification strategy used throughout the paper to handle heterogeneous treatment effects in the staggered DiD design"
    454     },
    455     {
    456       "title": "When Are Combinations of Humans and AI Useful?",
    457       "relevance": "Vaccaro et al. (2024) — meta-analysis finding human-AI teams often underperform; this paper provides a more positive counterexample where augmentation works at scale"
    458     },
    459     {
    460       "title": "AI, Skill, and Productivity: The Case of Taxi Drivers",
    461       "relevance": "Kanazawa et al. (2022) — study of non-generative AI for taxi drivers finding low-skill drivers benefited most; consistent with this paper's central skill-heterogeneity finding"
    462     },
    463     {
    464       "title": "Robots and Jobs: Evidence from US Labor Markets",
    465       "relevance": "Acemoglu and Restrepo (2020) — canonical skill-biased automation paper; this paper explicitly positions against this prior work by showing generative AI inverts the typical complementarity pattern"
    466     },
    467     {
    468       "title": "More than a Feeling: Accuracy and Application of Sentiment Analysis",
    469       "relevance": "Hartmann et al. (2023) — provides the SiEBERT sentiment model used for customer and agent sentiment measurement, a core outcome variable"
    470     },
    471     {
    472       "title": "AI Assistance in Legal Analysis: An Empirical Study",
    473       "relevance": "Choi and Schwarcz (2023) — contemporaneous study of AI assistance for law students; part of the cluster of generative AI productivity studies this paper is situated within and contrasted against"
    474     }
    475   ],
    476   "engagement_factors": {
    477     "practical_relevance": {
    478       "score": 3,
    479       "justification": "Direct evidence from a real enterprise deployment of an AI assistant across thousands of workers; immediately actionable for firms considering AI adoption decisions."
    480     },
    481     "surprise_contrarian": {
    482       "score": 2,
    483       "justification": "Counterintuitive finding that generative AI benefits lower-skilled workers most, inverting the traditional skill-biased technical change pattern; the durable learning result (gains persist during outages) also challenges the 'crutch' narrative."
    484     },
    485     "fear_safety": {
    486       "score": 1,
    487       "justification": "Raises questions about labor displacement (12% fewer worker-hours to serve same volume) and AI training quality degradation as top workers stop innovating, but these are framed as open questions rather than immediate risks."
    488     },
    489     "drama_conflict": {
    490       "score": 1,
    491       "justification": "Modest conflict angle: top performers may be harmed by AI reducing their quality scores, and the incentive problem of top workers contributing to AI training data is raised as a concern."
    492     },
    493     "demo_ability": {
    494       "score": 2,
    495       "justification": "The general category of tools (ChatGPT, Copilot) is publicly accessible and relatable; the specific enterprise customer service system cannot be demoed but findings easily extrapolate to tools readers already use."
    496     },
    497     "brand_recognition": {
    498       "score": 3,
    499       "justification": "Erik Brynjolfsson (Stanford/NBER) is one of the most prominent economists studying technology and labor; MIT/Stanford/NBER pedigree and Fortune 500 firm context add high credibility signal."
    500     }
    501   },
    502   "hn_data": {
    503     "threads": [
    504       {
    505         "hn_id": "43577957",
    506         "title": "A Study of Undefined Behavior Across Foreign Function Boundaries in Rust Libs",
    507         "points": 4,
    508         "comments": 1,
    509         "url": "https://news.ycombinator.com/item?id=43577957",
    510         "created_at": "2025-04-04T03:10:05Z"
    511       },
    512       {
    513         "hn_id": "36513194",
    514         "title": "On the Planning Abilities of Large Language Models – A Critical Investigation",
    515         "points": 3,
    516         "comments": 0,
    517         "url": "https://news.ycombinator.com/item?id=36513194",
    518         "created_at": "2023-06-28T21:57:00Z"
    519       },
    520       {
    521         "hn_id": "31457199",
    522         "title": "Masked image modeling advances 3D medical image analysis",
    523         "points": 2,
    524         "comments": 0,
    525         "url": "https://news.ycombinator.com/item?id=31457199",
    526         "created_at": "2022-05-21T12:15:54Z"
    527       },
    528       {
    529         "hn_id": "36053359",
    530         "title": "Why Is the Winner the Best?",
    531         "points": 1,
    532         "comments": 1,
    533         "url": "https://news.ycombinator.com/item?id=36053359",
    534         "created_at": "2023-05-24T02:27:19Z"
    535       },
    536       {
    537         "hn_id": "44041341",
    538         "title": "Grounded in Context: Retrieval-Based Method for Hallucination Detection",
    539         "points": 1,
    540         "comments": 0,
    541         "url": "https://news.ycombinator.com/item?id=44041341",
    542         "created_at": "2025-05-20T13:23:42Z"
    543       },
    544       {
    545         "hn_id": "39736862",
    546         "title": "The Planning Abilities of LLMs: A Critical Investigation (2023)",
    547         "points": 1,
    548         "comments": 0,
    549         "url": "https://news.ycombinator.com/item?id=39736862",
    550         "created_at": "2024-03-17T18:45:17Z"
    551       },
    552       {
    553         "hn_id": "38586493",
    554         "title": "On the Planning Abilities of Large Language Models: A Critical Investigation",
    555         "points": 1,
    556         "comments": 0,
    557         "url": "https://news.ycombinator.com/item?id=38586493",
    558         "created_at": "2023-12-09T21:57:24Z"
    559       },
    560       {
    561         "hn_id": "37627101",
    562         "title": "How Robust Is Google's Bard to Adversarial Image Attacks?",
    563         "points": 1,
    564         "comments": 0,
    565         "url": "https://news.ycombinator.com/item?id=37627101",
    566         "created_at": "2023-09-23T20:20:29Z"
    567       },
    568       {
    569         "hn_id": "37158226",
    570         "title": "What Types of Questions Require Conversation to Answer? AskReddit Study",
    571         "points": 1,
    572         "comments": 0,
    573         "url": "https://news.ycombinator.com/item?id=37158226",
    574         "created_at": "2023-08-17T07:07:27Z"
    575       },
    576       {
    577         "hn_id": "36105713",
    578         "title": "On the Planning Abilities of Large Language Models – A Critical Investigation",
    579         "points": 1,
    580         "comments": 0,
    581         "url": "https://news.ycombinator.com/item?id=36105713",
    582         "created_at": "2023-05-28T16:51:29Z"
    583       }
    584     ],
    585     "top_points": 4,
    586     "total_points": 16,
    587     "total_comments": 2
    588   }
    589 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs