scan-v4.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan-v4.json (38418B)
      1 {
      2   "scan_version": 4,
      3   "paper_type": "empirical",
      4   "paper": {
      5     "title": "Evaluating Large Language Models Trained on Code",
      6     "authors": [
      7       "Chen, M.",
      8       "Tworek, J.",
      9       "Jun, H.",
     10       "Yuan, Q.",
     11       "Pinto, H. P. d. O.",
     12       "et al."
     13     ],
     14     "year": 2021,
     15     "venue": "arXiv",
     16     "arxiv_id": "2107.03374",
     17     "doi": null
     18   },
     19   "checklist": {
     20     "claims_and_evidence": {
     21       "abstract_claims_supported": {
     22         "applies": true,
     23         "answer": true,
     24         "justification": "All abstract claims are supported: 28.8% pass rate (Table 1), GPT-3 at 0% (Table 1), GPT-J at 11.4% (Table 1), 70.2% with 100 samples (Section 1, consistent with Figure 1 showing Codex-S at 77.5%). The abstract's qualitative claims about limitations are supported by Section 6.",
     25         "source": "opus"
     26       },
     27       "causal_claims_justified": {
     28         "applies": true,
     29         "answer": true,
     30         "justification": "Causal claims about fine-tuning improving performance are supported by controlled comparisons: same base model architecture with and without code fine-tuning (GPT vs Codex), and with and without supervised fine-tuning (Codex vs Codex-S). Each comparison varies a single factor. The alignment analysis (Appendix E) carefully distinguishes capability from alignment.",
     31         "source": "opus"
     32       },
     33       "generalization_bounded": {
     34         "applies": true,
     35         "answer": true,
     36         "justification": "The paper explicitly bounds its scope: 'In this work, we focus on the task of generating standalone Python functions from docstrings' (Section 1). The abstract states 'study its Python code-writing capabilities.' While the broader impacts section discusses general code generation, the empirical claims are bounded to Python.",
     37         "source": "opus"
     38       },
     39       "alternative_explanations_discussed": {
     40         "applies": true,
     41         "answer": true,
     42         "justification": "Section 7.2 and Appendix E explicitly distinguish misalignment from incompetence as alternative explanations for model failures. Appendix E.3 considers whether poor performance on buggy prompts could be a robustness failure rather than misalignment. Section 4 discusses data distribution mismatch as a factor.",
     43         "source": "opus"
     44       },
     45       "proxy_outcome_distinction": {
     46         "applies": true,
     47         "answer": true,
     48         "justification": "Section 2.1 explicitly argues for functional correctness over BLEU score as the evaluation metric, and Figure 8 demonstrates empirically that BLEU is a poor proxy for correctness. The paper measures pass@k and claims code generation capability — the measurement and claim are well-aligned with minimal proxy gap.",
     49         "source": "opus"
     50       }
     51     },
     52     "limitations_and_scope": {
     53       "limitations_section_present": {
     54         "applies": true,
     55         "answer": true,
     56         "justification": "Section 6 is a dedicated 'Limitations' section discussing specific shortcomings. Additionally, Section 7 (Broader Impacts) provides extensive discussion of risks and limitations across over-reliance, misalignment, bias, security, and economic impacts.",
     57         "source": "opus"
     58       },
     59       "threats_to_validity_specific": {
     60         "applies": true,
     61         "answer": true,
     62         "justification": "Section 6 discusses specific threats: Codex is 'not sample efficient to train,' struggles with 'docstrings describing long chains of operations' (quantified in Figure 11), and has 'difficulty with binding operations to variables' (concrete code example provided). Appendix E discusses specific alignment threats with empirical evidence.",
     63         "source": "opus"
     64       },
     65       "scope_boundaries_stated": {
     66         "applies": true,
     67         "answer": true,
     68         "justification": "The paper explicitly states it focuses on 'generating standalone Python functions from docstrings' (Section 1). Section 6 notes the model 'struggles to parse through increasingly long and higher-level or system-level specifications.' The broader impacts section (7.5) states 'at their current level of capability, Codex models do not materially lower the barrier to entry for malware development.'",
     69         "source": "opus"
     70       }
     71     },
     72     "conflicts_of_interest": {
     73       "funding_disclosed": {
     74         "applies": true,
     75         "answer": true,
     76         "justification": "The acknowledgments section states 'we thank GitHub for partnering to build GitHub Copilot and Microsoft Azure for supporting model training with infrastructure management,' disclosing corporate support for the research.",
     77         "source": "opus"
     78       },
     79       "affiliations_disclosed": {
     80         "applies": true,
     81         "answer": true,
     82         "justification": "Author affiliations are clearly listed: '1 OpenAI, San Francisco, California, USA. 2 Anthropic AI, San Francisco, California, USA. Work performed while at OpenAI. 3 Zipline, South San Francisco, California, USA. Work performed while at OpenAI.'",
     83         "source": "opus"
     84       },
     85       "funder_independent_of_outcome": {
     86         "applies": true,
     87         "answer": false,
     88         "justification": "OpenAI has a direct financial interest in Codex's performance — the paper states 'A distinct production version of Codex powers GitHub Copilot.' OpenAI is evaluating its own commercial product. Microsoft Azure (infrastructure provider) and GitHub (partner for Copilot) also have financial stakes.",
     89         "source": "opus"
     90       },
     91       "financial_interests_declared": {
     92         "applies": true,
     93         "answer": false,
     94         "justification": "No competing interests or financial disclosure statement is present. The paper does not include a standard conflicts-of-interest declaration, despite OpenAI's commercial interest in Codex through GitHub Copilot.",
     95         "source": "opus"
     96       }
     97     },
     98     "scope_and_framing": {
     99       "key_terms_defined": {
    100         "applies": true,
    101         "answer": true,
    102         "justification": "Pass@k is formally defined with an unbiased estimator (Equation 1) and a numpy implementation; functional correctness is defined; 'alignment' is operationalized in Appendix E; HumanEval dataset is precisely characterized.",
    103         "source": "haiku"
    104       },
    105       "intended_contribution_clear": {
    106         "applies": true,
    107         "answer": true,
    108         "justification": "The paper clearly states it contributes the Codex model family, the HumanEval benchmark, the unbiased pass@k evaluation methodology, and a broader impacts/hazard analysis for code generation systems.",
    109         "source": "haiku"
    110       },
    111       "engagement_with_prior_work": {
    112         "applies": true,
    113         "answer": true,
    114         "justification": "Section 8 provides comprehensive engagement with program induction, program synthesis, code LMs (CodeBERT, PyMT5), functional correctness metrics (SPoC, TransCoder), and code benchmarks (APPS, CodeXGLUE), situating contributions clearly.",
    115         "source": "haiku"
    116       }
    117     }
    118   },
    119   "type_checklist": {
    120     "empirical": {
    121       "artifacts": {
    122         "code_released": {
    123           "applies": true,
    124           "answer": true,
    125           "justification": "The HumanEval evaluation framework is released at https://www.github.com/openai/human-eval. Alignment evaluation data is released at https://github.com/openai/code-align-evals-data. However, the Codex model itself is not released.",
    126           "source": "opus"
    127         },
    128         "data_released": {
    129           "applies": true,
    130           "answer": true,
    131           "justification": "The HumanEval dataset of 164 hand-written programming problems is released at https://www.github.com/openai/human-eval, as stated in Section 2.2.",
    132           "source": "opus"
    133         },
    134         "environment_specified": {
    135           "applies": true,
    136           "answer": false,
    137           "justification": "No requirements.txt, Dockerfile, or detailed environment setup is provided. The paper mentions Python and numpy (Figure 3) but does not provide enough detail to recreate the environment.",
    138           "source": "opus"
    139         },
    140         "reproduction_instructions": {
    141           "applies": true,
    142           "answer": false,
    143           "justification": "No step-by-step reproduction instructions are provided. While the evaluation dataset is released, the model weights are not, and there are no scripts or README with commands to replicate the main experiments.",
    144           "source": "opus"
    145         }
    146       },
    147       "statistical_methodology": {
    148         "confidence_intervals_or_error_bars": {
    149           "applies": true,
    150           "answer": false,
    151           "justification": "All pass@k results in Tables 1 and 2 are reported as point estimates without confidence intervals or error bars. While the paper develops an unbiased estimator for pass@k (Equation 1, Appendix A), no uncertainty bounds on the estimates are provided.",
    152           "source": "opus"
    153         },
    154         "significance_tests": {
    155           "applies": true,
    156           "answer": false,
    157           "justification": "Comparative claims (e.g., Codex outperforms GPT-J, Codex-S outperforms Codex) are made by comparing raw pass@k numbers without any statistical significance tests.",
    158           "source": "opus"
    159         },
    160         "effect_sizes_reported": {
    161           "applies": true,
    162           "answer": true,
    163           "justification": "Results are reported with full baseline context: Codex-12B solves 28.8% pass@1 vs GPT-3 at 0% and GPT-J at 11.4% (Section 1). Codex-S's improvement over Codex is quantified as 'an average margin of 6.5 percentage points on pass@1 and 15.1 percentage points on pass@100' (Section 4.5). The magnitude of differences is clear throughout.",
    164           "source": "opus"
    165         },
    166         "sample_size_justified": {
    167           "applies": true,
    168           "answer": false,
    169           "justification": "The HumanEval dataset contains 164 problems but there is no justification for why 164 was chosen, no power analysis, and no discussion of whether this sample size is sufficient for the claims made.",
    170           "source": "opus"
    171         },
    172         "variance_reported": {
    173           "applies": true,
    174           "answer": false,
    175           "justification": "Results are reported as single point estimates. While the unbiased pass@k estimator (Equation 1) accounts for sampling variance mathematically, no standard deviations, error bars, or spread measures across experimental runs are reported.",
    176           "source": "opus"
    177         }
    178       },
    179       "evaluation_design": {
    180         "baselines_included": {
    181           "applies": true,
    182           "answer": true,
    183           "justification": "Multiple baselines are included: GPT-3 (various sizes), GPT-Neo (125M, 1.3B, 2.7B), GPT-J-6B, and TabNine, all evaluated on HumanEval (Table 1). For APPS, GPT-Neo 2.7B fine-tuned results from Hendrycks et al. serve as baseline (Table 2).",
    184           "source": "opus"
    185         },
    186         "baselines_contemporary": {
    187           "applies": true,
    188           "answer": true,
    189           "justification": "GPT-Neo, GPT-J, and TabNine were all contemporary at time of publication (2021). GPT-J was released in May 2021, the same year as this paper.",
    190           "source": "opus"
    191         },
    192         "ablation_study": {
    193           "applies": true,
    194           "answer": true,
    195           "justification": "The paper systematically ablates key components: GPT (no code fine-tuning) vs Codex (code fine-tuned) vs Codex-S (supervised fine-tuned), showing the contribution of each stage. Model size is varied across 8 scales (12M to 12B). Different sampling strategies (random, mean log-prob, back-translation) are compared (Figure 7).",
    196           "source": "opus"
    197         },
    198         "multiple_metrics": {
    199           "applies": true,
    200           "answer": true,
    201           "justification": "Multiple evaluation metrics are used: pass@1, pass@10, pass@100 (Table 1), BLEU scores (Figure 8), test loss (Figure 4), and the APPS dataset metrics including raw and filtered pass@k (Table 2).",
    202           "source": "opus"
    203         },
    204         "human_evaluation": {
    205           "applies": true,
    206           "answer": true,
    207           "justification": "Codex-D docstring outputs are graded by hand: 'we grade sample docstrings by hand, considering a docstring correct if it uniquely and accurately specifies the code body. Due to the time consuming nature of this process, we only grade 10 samples per problem, for a total of 1640 problems' (Section 5, Table 3).",
    208           "source": "opus"
    209         },
    210         "held_out_test_set": {
    211           "applies": true,
    212           "answer": true,
    213           "justification": "HumanEval is specifically designed as a held-out test set: 'It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources' (Section 2.2).",
    214           "source": "opus"
    215         },
    216         "per_category_breakdown": {
    217           "applies": true,
    218           "answer": true,
    219           "justification": "Table 1 breaks down results by model size across all baselines. Table 2 breaks APPS results by difficulty level (Introductory, Interview, Competition). Figure 11 shows performance degradation by chain length. Figure 5 breaks down results by temperature.",
    220           "source": "opus"
    221         },
    222         "failure_cases_discussed": {
    223           "applies": true,
    224           "answer": true,
    225           "justification": "Section 6 (Limitations) provides detailed failure analysis: difficulty with long chains of operations (Figure 11), variable binding errors (concrete code example in Section 6), misalignment producing buggy code when prompted with buggy code (Figure 12), and insecure code generation (Figure 15, Appendix G).",
    226           "source": "opus"
    227         },
    228         "negative_results_reported": {
    229           "applies": true,
    230           "answer": true,
    231           "justification": "Multiple negative results: 'we did not observe improvements when starting from a pre-trained language model' (Section 3.2); back-translation ranking 'underperforms mean log-probability ranking' (Section 5); 'choosing the sample based on sum log probability can perform slightly worse than picking randomly' (Section 3.3); misalignment worsens with scale (Figure 12).",
    232           "source": "opus"
    233         }
    234       },
    235       "setup_transparency": {
    236         "model_versions_specified": {
    237           "applies": true,
    238           "answer": true,
    239           "justification": "Exact model sizes are specified for all models evaluated: Codex at 12M, 25M, 42M, 85M, 300M, 679M, 2.5B, and 12B parameters; GPT-Neo at 125M, 1.3B, 2.7B; GPT-J at 6B (Table 1). Since these are the authors' own models, parameter counts uniquely identify each variant.",
    240           "source": "opus"
    241         },
    242         "prompts_provided": {
    243           "applies": true,
    244           "answer": true,
    245           "justification": "Figure 2 shows three complete example prompts with the exact format used (header, signature, docstring). Stop sequences are specified: '\\nclass', '\\ndef', '\\n#', '\\nif', '\\nprint' (Section 3.2). Appendix B provides 8 additional full prompt examples. Appendix E.5 shows alignment evaluation prompts.",
    246           "source": "opus"
    247         },
    248         "hyperparameters_reported": {
    249           "applies": true,
    250           "answer": true,
    251           "justification": "Detailed hyperparameters are reported: nucleus sampling with top p=0.95, temperatures tested (0.2, 0.4, 0.8), n=200 samples per task. Training: 175-step linear warmup, cosine learning rate decay, 100 billion tokens, Adam with β1=0.9, β2=0.95, ε=10⁻⁸, weight decay 0.1 (Section 3.2).",
    252           "source": "opus"
    253         },
    254         "scaffolding_described": {
    255           "applies": false,
    256           "answer": false,
    257           "justification": "No agentic scaffolding is used. Codex performs direct model inference from prompts without any tool use, retry logic, or multi-step workflow.",
    258           "source": "opus"
    259         },
    260         "data_preprocessing_documented": {
    261           "applies": true,
    262           "answer": true,
    263           "justification": "Section 3.1 documents data preprocessing in detail: collected from 54 million GitHub repos (179 GB unique Python files under 1 MB), filtered out auto-generated files, average line length >100, max line length >1000, low alphanumeric percentage, resulting in 159 GB final dataset. Tokenizer adaptations for whitespace are described (Section 3.2).",
    264           "source": "opus"
    265         }
    266       },
    267       "data_integrity": {
    268         "raw_data_available": {
    269           "applies": true,
    270           "answer": false,
    271           "justification": "The HumanEval evaluation dataset is released, but the 159 GB training dataset is not. The training data cannot be independently verified. The model weights are also not released, preventing independent replication of results.",
    272           "source": "opus"
    273         },
    274         "data_collection_described": {
    275           "applies": true,
    276           "answer": true,
    277           "justification": "Section 3.1 describes data collection in detail: '54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB,' collected in May 2020. Filtering criteria are specified. Section 4.1-4.2 describe supervised fine-tuning data collection from competitive programming sites and CI-traced projects.",
    278           "source": "opus"
    279         },
    280         "recruitment_methods_described": {
    281           "applies": false,
    282           "answer": false,
    283           "justification": "No human participants are involved. Training data comes from public GitHub repositories. HumanEval problems were hand-written by the authors. The Codex-D docstring grading (Section 5) is done by the authors, not recruited participants.",
    284           "source": "opus"
    285         },
    286         "data_pipeline_documented": {
    287           "applies": true,
    288           "answer": true,
    289           "justification": "The data pipeline is documented: GitHub scraping (179 GB) → filtering by auto-generation, line length, alphanumeric content (159 GB final). For Codex-S training data: competitive programming problems (10,000 curated) + CI-traced functions (~40,000) → quality filtering using Codex-12B to remove ambiguous/stateful problems (Section 4.3).",
    290           "source": "opus"
    291         }
    292       },
    293       "contamination": {
    294         "training_cutoff_stated": {
    295           "applies": true,
    296           "answer": true,
    297           "justification": "Section 3.1 states 'Our training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub.'",
    298           "source": "opus"
    299         },
    300         "train_test_overlap_discussed": {
    301           "applies": true,
    302           "answer": true,
    303           "justification": "Section 2.2 explicitly discusses this concern: 'It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources. For example, there are more than ten public repositories containing solutions to Codeforces problems, which make up part of the recently proposed APPS dataset.'",
    304           "source": "opus"
    305         },
    306         "benchmark_contamination_addressed": {
    307           "applies": true,
    308           "answer": true,
    309           "justification": "HumanEval was specifically designed to avoid contamination: hand-written problems created after the training data collection (May 2020). The paper notes 'Though not a guarantee for problem novelty, all problems were hand-written and not programmatically copied from existing sources' (Figure 2 caption). For APPS, contamination risk is acknowledged.",
    310           "source": "opus"
    311         }
    312       },
    313       "human_studies": {
    314         "pre_registered": {
    315           "applies": false,
    316           "answer": false,
    317           "justification": "No human participants in this study. The paper evaluates models on benchmarks.",
    318           "source": "opus"
    319         },
    320         "irb_or_ethics_approval": {
    321           "applies": false,
    322           "answer": false,
    323           "justification": "No human participants in this study.",
    324           "source": "opus"
    325         },
    326         "demographics_reported": {
    327           "applies": false,
    328           "answer": false,
    329           "justification": "No human participants in this study.",
    330           "source": "opus"
    331         },
    332         "inclusion_exclusion_criteria": {
    333           "applies": false,
    334           "answer": false,
    335           "justification": "No human participants in this study.",
    336           "source": "opus"
    337         },
    338         "randomization_described": {
    339           "applies": false,
    340           "answer": false,
    341           "justification": "No human participants in this study.",
    342           "source": "opus"
    343         },
    344         "blinding_described": {
    345           "applies": false,
    346           "answer": false,
    347           "justification": "No human participants in this study.",
    348           "source": "opus"
    349         },
    350         "attrition_reported": {
    351           "applies": false,
    352           "answer": false,
    353           "justification": "No human participants in this study.",
    354           "source": "opus"
    355         }
    356       },
    357       "cost_and_practicality": {
    358         "inference_cost_reported": {
    359           "applies": true,
    360           "answer": false,
    361           "justification": "No inference costs are reported. The paper generates 200 samples per problem for 164 problems across multiple model sizes and temperatures, but does not report the total inference cost, tokens consumed, or wall-clock time.",
    362           "source": "opus"
    363         },
    364         "compute_budget_stated": {
    365           "applies": true,
    366           "answer": true,
    367           "justification": "Section 7.6 states 'The original training of GPT-3-12B consumed hundreds of petaflop/s-days of compute, while fine-tuning it to create Codex-12B consumed a similar amount of compute.' Section 3.2 states training was for 100 billion tokens. The platform (Azure) is identified.",
    368           "source": "opus"
    369         }
    370       },
    371       "experimental_rigor": {
    372         "seed_sensitivity_reported": {
    373           "applies": true,
    374           "answer": false,
    375           "justification": "No seed sensitivity analysis is reported. Results are generated from a single set of 200 samples per problem with no discussion of how results vary across random seeds.",
    376           "source": "opus"
    377         },
    378         "number_of_runs_stated": {
    379           "applies": true,
    380           "answer": true,
    381           "justification": "The number of samples is clearly stated: 'we generate n ≥ k samples per task (in this paper, we use n = 200 and k ≤ 100)' (Section 2.1). For APPS evaluation, 1000 samples are generated per task.",
    382           "source": "opus"
    383         },
    384         "hyperparameter_search_budget": {
    385           "applies": true,
    386           "answer": false,
    387           "justification": "While Figure 5 shows pass@k at different temperatures (0.2, 0.4, 0.8), no formal hyperparameter search budget is stated (number of configurations tried, search method, or total compute spent on search).",
    388           "source": "opus"
    389         },
    390         "best_config_selection_justified": {
    391           "applies": true,
    392           "answer": true,
    393           "justification": "The temperature selection process is transparent: Figure 5 plots pass@k against temperature for various k values, and the optimal temperature is selected from the upper hull. They report 'the optimal temperature for pass@1 is T*=0.2 and the optimal temperature for pass@100 is T*=0.8' for the 679M model (Section 3.3).",
    394           "source": "opus"
    395         },
    396         "multiple_comparison_correction": {
    397           "applies": false,
    398           "answer": false,
    399           "justification": "No formal statistical tests with p-values are performed, so multiple comparison correction is not applicable.",
    400           "source": "opus"
    401         },
    402         "self_comparison_bias_addressed": {
    403           "applies": true,
    404           "answer": false,
    405           "justification": "The authors compare their own Codex against GPT-Neo, GPT-J, and TabNine without acknowledging the potential bias of authors evaluating their own system. There is no discussion of how having full control over one system but not others might affect the comparison.",
    406           "source": "opus"
    407         },
    408         "compute_budget_vs_performance": {
    409           "applies": true,
    410           "answer": true,
    411           "justification": "Figures 1, 4, and 6 show performance as a function of model size (a proxy for compute). Figure 4 shows test loss follows a power law with model size. Figure 6 shows pass@1 and pass@100 scale as sigmoids in log-parameters.",
    412           "source": "opus"
    413         },
    414         "benchmark_construct_validity": {
    415           "applies": true,
    416           "answer": true,
    417           "justification": "Section 2.1 provides extensive discussion of construct validity: argues functional correctness is superior to BLEU for measuring code generation, shows empirically that BLEU fails to distinguish correct from incorrect code (Figure 8), and discusses how functional correctness mirrors real software development practice (test-driven development).",
    418           "source": "opus"
    419         },
    420         "scaffold_confound_addressed": {
    421           "applies": false,
    422           "answer": false,
    423           "justification": "No scaffolding is involved. All models are evaluated via direct inference with the same prompting approach, so there is no scaffold confound.",
    424           "source": "opus"
    425         }
    426       },
    427       "data_leakage": {
    428         "temporal_leakage_addressed": {
    429           "applies": true,
    430           "answer": true,
    431           "justification": "Training data was collected in May 2020 (Section 3.1). HumanEval was hand-written specifically for this evaluation and did not exist before the training data collection, inherently addressing temporal leakage. For APPS, the paper acknowledges that GitHub contains Codeforces solutions.",
    432           "source": "opus"
    433         },
    434         "feature_leakage_addressed": {
    435           "applies": true,
    436           "answer": false,
    437           "justification": "The paper does not explicitly discuss whether the evaluation setup leaks information through features. For example, function signatures and docstring style in HumanEval may provide implicit cues not available in real usage scenarios.",
    438           "source": "opus"
    439         },
    440         "non_independence_addressed": {
    441           "applies": true,
    442           "answer": true,
    443           "justification": "Section 2.2 addresses non-independence by hand-writing HumanEval: 'It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.' This ensures test problems are independent of training data.",
    444           "source": "opus"
    445         },
    446         "leakage_detection_method": {
    447           "applies": true,
    448           "answer": false,
    449           "justification": "No concrete leakage detection method is applied (no canary strings, membership inference, or n-gram overlap analysis). While hand-writing HumanEval is a prevention strategy, the paper acknowledges 'Though not a guarantee for problem novelty' (Figure 2) and no verification method confirms the problems do not overlap with training data.",
    450           "source": "opus"
    451         }
    452       }
    453     }
    454   },
    455   "claims": [
    456     {
    457       "claim": "Codex-12B achieves 28.8% pass@1 on HumanEval, substantially outperforming GPT-3 (~0%) and GPT-J-6B (11.4%)",
    458       "evidence": "Table 1 reports Codex-12B pass@1 = 28.81%, GPT-J 6B = 11.62%, GPT models near 0%; confirmed in abstract and Figure 1",
    459       "supported": "strong"
    460     },
    461     {
    462       "claim": "Supervised fine-tuning (Codex-S) improves over base Codex by an average of 6.5pp on pass@1 and 15.1pp on pass@100 across model sizes",
    463       "evidence": "Section 4.5 explicitly states these figures; Figure 10 shows Codex-S is '1-2 orders of magnitude more parameter efficient' on both metrics",
    464       "supported": "strong"
    465     },
    466     {
    467       "claim": "BLEU score is not a reliable proxy for functional correctness in code generation",
    468       "evidence": "Figure 8 shows heavily overlapping BLEU score distributions for correct and incorrect Codex-12B completions across 4 random tasks, demonstrating BLEU cannot separate functionally correct from incorrect code",
    469       "supported": "strong"
    470     },
    471     {
    472       "claim": "Performance degrades exponentially with the number of chained operations in a docstring, dropping by roughly a factor of 2-3 per additional component",
    473       "evidence": "Figure 11 plots pass rates against chain length for synthetic 13-building-block tasks, showing near-exponential degradation for Codex-12B",
    474       "supported": "strong"
    475     },
    476     {
    477       "claim": "Mean log-probability ranking outperforms random sample selection as a practical alternative to oracle unit-test access",
    478       "evidence": "Figure 7 shows log-prob ranking consistently outperforms random selection for Codex-12B; the 'average benefit over random ranking is 11.6 percentage points' for Codex-S (Section 4.5)",
    479       "supported": "strong"
    480     },
    481     {
    482       "claim": "Code generation performance follows the same power-law scaling with model size as general language model training",
    483       "evidence": "Figure 4 shows test loss after code fine-tuning follows functional form (N/5.92×10^7)^{-0.13}; Figure 6 shows smooth scaling of pass@1 and pass@100 with model size",
    484       "supported": "strong"
    485     },
    486     {
    487       "claim": "Codex frequently generates insecure cryptographic code, with no clear improvement from model scaling",
    488       "evidence": "Figure 15 shows a 'significant fraction' of outputs use insecure RSA key sizes or AES-ECB mode; 'we do not see a robust model size trend' suggesting this is an alignment rather than capability limitation",
    489       "supported": "moderate"
    490     },
    491     {
    492       "claim": "GPT-J-6B (11.6% pass@1) is roughly equivalent to Codex-300M, a 20× parameter efficiency advantage for code fine-tuning",
    493       "evidence": "Table 1 and Section 3.4 directly compare GPT-J 6B (11.62% pass@1) against Codex model sizes; Codex-300M achieves 13.17%",
    494       "supported": "strong"
    495     }
    496   ],
    497   "methodology_tags": [
    498     "benchmark-eval",
    499     "observational"
    500   ],
    501   "key_findings": "Codex, a GPT model fine-tuned on 159GB of filtered public GitHub Python code, achieves 28.8% pass@1 on the new HumanEval benchmark (164 hand-written problems), substantially outperforming GPT-3 (0%) and GPT-J (11.4%). Repeated sampling is highly effective: selecting the unit-test-passing sample from 100 candidates achieves 77.5% coverage with Codex-S, while mean log-probability ranking provides a practical oracle-free approximation. The paper introduces HumanEval and an unbiased pass@k estimator, demonstrating that BLEU is an unreliable metric for code correctness. An extensive hazard analysis documents concrete risks: alignment failures (buggy context propagates buggy output, worsening with scale), frequent insecure cryptographic code generation, bias encoded in code completions, and potential labor market and legal implications.",
    502   "red_flags": [
    503     {
    504       "flag": "Self-evaluation without independence",
    505       "detail": "OpenAI employees evaluate Codex, OpenAI's own commercial product powering GitHub Copilot, with no independent third-party validation of reported results."
    506     },
    507     {
    508       "flag": "No confidence intervals on benchmark results",
    509       "detail": "All pass@k results are point estimates with no confidence intervals or error bars; on a 164-problem benchmark, small percentage-point differences may not be statistically distinguishable."
    510     },
    511     {
    512       "flag": "Abstract 70.2% vs Table 72.31% discrepancy",
    513       "detail": "The abstract claims '70.2% of our problems with 100 samples' but Table 1 shows Codex-12B pass@100 = 72.31%; no explanation is given for the ~2pp discrepancy."
    514     },
    515     {
    516       "flag": "Production model differs from evaluated models",
    517       "detail": "The paper explicitly states 'A distinct production version of Codex powers GitHub Copilot,' meaning evaluated research models may not reflect the deployed system's capabilities."
    518     },
    519     {
    520       "flag": "Small benchmark size with no power analysis",
    521       "detail": "HumanEval contains only 164 problems with no power analysis or justification; this limits the statistical power of comparisons between models differing by a few percentage points."
    522     }
    523   ],
    524   "cited_papers": [
    525     {
    526       "title": "Language Models are Few-Shot Learners (GPT-3)",
    527       "relevance": "Foundation model Codex is fine-tuned from; used as primary baseline demonstrating 0% pass@1 on HumanEval, establishing the value of code specialization"
    528     },
    529     {
    530       "title": "Measuring Coding Challenge Competence with APPS",
    531       "relevance": "Secondary benchmark used to evaluate Codex generalization; introduces introductory/interview/competition difficulty tiers for coding evaluation"
    532     },
    533     {
    534       "title": "SPoC: Search-based Pseudocode to Code",
    535       "relevance": "Introduced the pass@k functional correctness metric that this paper adopts and extends with an unbiased estimator; directly cited as the source of the metric"
    536     },
    537     {
    538       "title": "Unsupervised Translation of Programming Languages (TransCoder)",
    539       "relevance": "Prior work using functional correctness evaluation for code; supports the paper's argument that BLEU is inadequate and functional correctness better captures model capability"
    540     },
    541     {
    542       "title": "CodeBERT: A Pre-Trained Model for Programming and Natural Languages",
    543       "relevance": "Prior code-specialized transformer representing the state of the art before Codex; establishes the lineage of code LMs the paper builds on"
    544     },
    545     {
    546       "title": "GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model",
    547       "relevance": "Primary open-source baseline comparison; demonstrates that general LLMs trained on mixed data (8% GitHub code) are substantially less capable than code-specialized models"
    548     },
    549     {
    550       "title": "Scaling Laws for Neural Language Models",
    551       "relevance": "Establishes the power law scaling framework that the paper verifies holds after code fine-tuning, connecting Codex's scaling behavior to the broader literature"
    552     },
    553     {
    554       "title": "Extracting Training Data from Large Language Models",
    555       "relevance": "Cited in security analysis; relevant to risks of training data memorization and potential leakage of sensitive data from code generation models"
    556     },
    557     {
    558       "title": "You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion",
    559       "relevance": "Directly relevant to supply chain attack risks; found training data poisoning can trigger code completers to suggest insecure code at runtime"
    560     },
    561     {
    562       "title": "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation",
    563       "relevance": "Aggregated programming benchmark suite that contextualizes the HumanEval contribution within the broader code evaluation landscape"
    564     }
    565   ],
    566   "engagement_factors": {
    567     "practical_relevance": {
    568       "score": 3,
    569       "justification": "Codex directly powers GitHub Copilot, a tool millions of developers use daily, and the paper introduces HumanEval which became a standard benchmark."
    570     },
    571     "surprise_contrarian": {
    572       "score": 2,
    573       "justification": "The finding that repeated sampling (100 samples) jumps from 28.8% to 77.5% was genuinely surprising and counterintuitive to most practitioners."
    574     },
    575     "fear_safety": {
    576       "score": 2,
    577       "justification": "Extensive analysis of insecure code generation, misalignment that worsens with scale, and potential for malware generation makes safety a major theme."
    578     },
    579     "drama_conflict": {
    580       "score": 1,
    581       "justification": "OpenAI evaluating its own commercial product raises mild conflict, but the paper is more celebratory than controversial."
    582     },
    583     "demo_ability": {
    584       "score": 2,
    585       "justification": "HumanEval benchmark is publicly released and Codex was available via API, though the model weights and training data were not released."
    586     },
    587     "brand_recognition": {
    588       "score": 3,
    589       "justification": "From OpenAI, powers GitHub Copilot used by millions, authored by figures including Dario Amodei, Sam McCandlish, and Ilya Sutskever."
    590     }
    591   },
    592   "hn_data": {
    593     "threads": [
    594       {
    595         "hn_id": "27786283",
    596         "title": "Evaluating Large Language Models Trained on Code",
    597         "points": 12,
    598         "comments": 1,
    599         "url": "https://news.ycombinator.com/item?id=27786283",
    600         "created_at": "2021-07-09T17:39:30Z"
    601       },
    602       {
    603         "hn_id": "27767328",
    604         "title": "Evaluating Large Language Models Trained on Code",
    605         "points": 11,
    606         "comments": 1,
    607         "url": "https://news.ycombinator.com/item?id=27767328",
    608         "created_at": "2021-07-08T00:36:26Z"
    609       },
    610       {
    611         "hn_id": "27777657",
    612         "title": "Evaluating Large Language Models Trained on Code (paper about GH copilot model)",
    613         "points": 4,
    614         "comments": 1,
    615         "url": "https://news.ycombinator.com/item?id=27777657",
    616         "created_at": "2021-07-08T21:10:26Z"
    617       },
    618       {
    619         "hn_id": "27770978",
    620         "title": "Evaluating Large Language Models Trained on Code(GitHub Copilot)",
    621         "points": 3,
    622         "comments": 0,
    623         "url": "https://news.ycombinator.com/item?id=27770978",
    624         "created_at": "2021-07-08T12:20:59Z"
    625       },
    626       {
    627         "hn_id": "34552130",
    628         "title": "Evaluating Large Language Models Trained on Code",
    629         "points": 2,
    630         "comments": 0,
    631         "url": "https://news.ycombinator.com/item?id=34552130",
    632         "created_at": "2023-01-27T21:27:58Z"
    633       },
    634       {
    635         "hn_id": "29172572",
    636         "title": "Measuring mathematical problem solving with the MATH dataset",
    637         "points": 2,
    638         "comments": 0,
    639         "url": "https://news.ycombinator.com/item?id=29172572",
    640         "created_at": "2021-11-10T09:00:47Z"
    641       },
    642       {
    643         "hn_id": "26070039",
    644         "title": "On the Reproducibility of Neural Network Predictions",
    645         "points": 2,
    646         "comments": 0,
    647         "url": "https://news.ycombinator.com/item?id=26070039",
    648         "created_at": "2021-02-08T21:00:30Z"
    649       }
    650     ],
    651     "top_points": 12,
    652     "total_points": 36,
    653     "total_comments": 3
    654   }
    655 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs