calibration.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

calibration.json (20553B)
      1 {
      2   "calibration": {
      3     "paper_slug": "agentless-2024",
      4     "calibration_date": "2026-02-28",
      5     "opus_model": "claude-opus-4-6",
      6     "sonnet_scan_date": null,
      7     "agreement_rate": 0.96,
      8     "total_questions": 50,
      9     "agreements": 48,
     10     "disagreements": 2,
     11     "disagreement_details": [
     12       {
     13         "category": "evaluation_design",
     14         "question": "human_evaluation",
     15         "sonnet": {"applies": true, "answer": true},
     16         "opus": {"applies": true, "answer": false},
     17         "direction": "sonnet_generous",
     18         "explanation": "Sonnet credits the manual classification of SWE-bench Lite problems (Section 6.1) as human evaluation. However, the schema is explicit: 'The humans must be evaluating what the system produced — manual classification of the benchmark or dataset itself does not count.' The Section 6.1 classification is about classifying the benchmark problems by description quality, solution presence, and location information — not evaluating AGENTLESS's outputs. There is no human evaluation of the system's generated patches, localizations, or reproduction tests. The evaluation of system outputs is entirely automated via test suites. Sonnet even acknowledges this in its justification ('though not of system outputs directly') but still answers YES."
     19       },
     20       {
     21         "category": "conflicts_of_interest",
     22         "question": "funder_independent_of_outcome",
     23         "sonnet": {"applies": false, "answer": false},
     24         "opus": {"applies": true, "answer": false},
     25         "direction": "applies_boundary",
     26         "explanation": "Sonnet sets applies=false reasoning that no funder is disclosed, so the criterion cannot be assessed. However, the schema says 'NA if unfunded' — meaning applies=false only if the work is confirmed to be unfunded. The paper does not disclose any funding, but absence of disclosure is not confirmation of no funding. The authors are at a major research university (UIUC) and used substantial OpenAI API credits for experiments. It is reasonable to expect funding disclosure exists but was omitted. Opus sets applies=true, answer=false because we cannot confirm unfunded status and absence of disclosure itself is a methodological concern."
     27       }
     28     ],
     29     "opus_checklist": {
     30       "artifacts": {
     31         "code_released": {
     32           "applies": true,
     33           "answer": true,
     34           "justification": "The abstract states 'We have open-sourced AGENTLESS at: https://github.com/OpenAutoCoder/Agentless'. A working GitHub URL is provided directly in the paper."
     35         },
     36         "data_released": {
     37           "applies": true,
     38           "answer": true,
     39           "justification": "The evaluation uses SWE-bench Lite, a publicly available standard benchmark. The paper did not collect proprietary data. SWE-bench is referenced as a public benchmark created by others."
     40         },
     41         "environment_specified": {
     42           "applies": true,
     43           "answer": false,
     44           "justification": "Section 4 mentions GPT-4o, LlamaIndex, OpenAI's text-embedding-3-small, and Python's ast library, but no requirements.txt, Dockerfile, or pinned dependency versions are provided in the paper."
     45         },
     46         "reproduction_instructions": {
     47           "applies": true,
     48           "answer": false,
     49           "justification": "The paper describes the approach's pipeline steps in detail (Section 3) and implementation parameters (Section 4), but does not include step-by-step reproduction instructions with commands to run. These may exist in the GitHub repo but are not in the paper."
     50         }
     51       },
     52       "statistical_methodology": {
     53         "confidence_intervals_or_error_bars": {
     54           "applies": true,
     55           "answer": false,
     56           "justification": "All results are point estimates only (e.g., '96 (32.00%)' resolved, '$0.70' cost). No confidence intervals, error bars, or uncertainty ranges are reported anywhere in the paper."
     57         },
     58         "significance_tests": {
     59           "applies": true,
     60           "answer": false,
     61           "justification": "The paper claims AGENTLESS outperforms all open-source agents (32.00% vs 30.67%) but uses no statistical significance tests. All comparisons are based on raw percentage differences."
     62         },
     63         "effect_sizes_reported": {
     64           "applies": true,
     65           "answer": false,
     66           "justification": "No formal effect sizes (Cohen's d, odds ratios, relative risk) are reported. Results are presented as raw solve counts and percentages. While the paper provides baseline context (e.g., 32.00% vs 30.67%), these are raw numbers, not standardized effect size measures."
     67         },
     68         "sample_size_justified": {
     69           "applies": true,
     70           "answer": false,
     71           "justification": "The paper uses SWE-bench Lite (300 problems) without any justification for why this sample size is adequate, nor any power analysis. The benchmark is taken as given."
     72         },
     73         "variance_reported": {
     74           "applies": true,
     75           "answer": false,
     76           "justification": "All results are single-run outcomes. No standard deviations, variance across seeds, or multi-run results with spread measures are reported. Figure 6 shows performance vs. number of samples but this is not variance across repeated runs."
     77         }
     78       },
     79       "evaluation_design": {
     80         "baselines_included": {
     81           "applies": true,
     82           "answer": true,
     83           "justification": "Table 1 compares against 26 agent-based approaches including both open-source (SWE-agent, AutoCodeRover, Moatless, Aider) and closed-source commercial tools. A RAG agentless baseline is also included."
     84         },
     85         "baselines_contemporary": {
     86           "applies": true,
     87           "answer": true,
     88           "justification": "Baselines include contemporary 2024 tools: CodeStory Aide, Bytedance MarsCode, SWE-agent with Claude 3.5 Sonnet, AutoCodeRover-v2, and many others from the SWE-bench leaderboard."
     89         },
     90         "ablation_study": {
     91           "applies": true,
     92           "answer": true,
     93           "justification": "Section 5.2 presents extensive ablation studies covering localization (Table 2), repair (Table 3), and patch validation (Table 4). Individual design choices are systematically varied."
     94         },
     95         "multiple_metrics": {
     96           "applies": true,
     97           "answer": true,
     98           "justification": "The paper reports % Resolved, Avg. $ Cost, Avg. # Tokens, and % Correct Location at three granularities (line, function, file). Section 5.1.3 also evaluates reproduction test quality."
     99         },
    100         "human_evaluation": {
    101           "applies": true,
    102           "answer": false,
    103           "justification": "Section 6.1 describes manual classification of SWE-bench Lite problems, but the schema explicitly states 'manual classification of the benchmark or dataset itself does not count.' There is no human evaluation of AGENTLESS's outputs (generated patches, localizations, or reproduction tests). The evaluation of system outputs is entirely automated via test suites."
    104         },
    105         "held_out_test_set": {
    106           "applies": true,
    107           "answer": true,
    108           "justification": "SWE-bench Lite is used as a fixed test set. No part of the benchmark is used for tuning the approach. The paper explicitly notes (footnote 2) that ground truth patches are only used for evaluation of reproduction test quality, not for the AGENTLESS process itself."
    109         },
    110         "per_category_breakdown": {
    111           "applies": true,
    112           "answer": true,
    113           "justification": "Figure 9 shows solve rates broken down by problem category (description quality, solution type, location information) for multiple tools. Tables 2-4 provide per-component breakdowns."
    114         },
    115         "failure_cases_discussed": {
    116           "applies": true,
    117           "answer": true,
    118           "justification": "Section 6.2 identifies that closed-source agent tools outperform AGENTLESS on problems with no location clues. Section 5.1.3 discusses the reproduction test drop-off (213 selected to 94 plausible). These are specific failure modes."
    119         },
    120         "negative_results_reported": {
    121           "applies": true,
    122           "answer": true,
    123           "justification": "The ablation studies report configurations that hurt performance: Table 2 shows 'direct from file-level' localization is worse (47.00% vs 56.33%), Table 3 shows merged multi-samples (28.33%) underperform multi-samples (32.00%)."
    124         }
    125       },
    126       "claims_and_evidence": {
    127         "abstract_claims_supported": {
    128           "applies": true,
    129           "answer": true,
    130           "justification": "The abstract claims 32.00% solve rate (96 correct fixes) at $0.70 cost among open-source approaches, supported by Table 1. The claim about OpenAI adoption is discussed in Section 5.1.4."
    131         },
    132         "causal_claims_justified": {
    133           "applies": true,
    134           "answer": true,
    135           "justification": "The ablation studies (Section 5.2) use controlled single-variable manipulation: one component is varied while others remain at default settings. This is adequate causal design for claims like 'reproduction tests improve performance from 81 to 96 issues.'"
    136         },
    137         "generalization_bounded": {
    138           "applies": true,
    139           "answer": false,
    140           "justification": "The paper tests only on Python repositories in SWE-bench Lite, but the title 'Demystifying LLM-based Software Engineering Agents' and the broad framing ('autonomous software development') imply generality beyond Python bug-fixing. Section 7 notes results may not generalize but the framing throughout overstates the scope."
    141         },
    142         "alternative_explanations_discussed": {
    143           "applies": true,
    144           "answer": false,
    145           "justification": "Section 7 mentions data leakage as an internal threat but does not discuss alternative explanations for why AGENTLESS outperforms agents. The paper does not consider whether the advantage stems from benchmark characteristics (Python-only, well-formed issues with location clues) rather than the agentless approach itself. No discussion of whether GPT-4o specifically is well-suited to the fixed-pipeline approach vs other models."
    146         }
    147       },
    148       "setup_transparency": {
    149         "model_versions_specified": {
    150           "applies": true,
    151           "answer": true,
    152           "justification": "Section 4 explicitly states 'GPT-4o (gpt-4o-2024-05-13)' with the specific API version, and 'text-embedding-3-small' for embeddings."
    153         },
    154         "prompts_provided": {
    155           "applies": true,
    156           "answer": false,
    157           "justification": "The paper describes what prompts do in natural language (e.g., 'we prompt the LLM to localize and rank the top N most suspicious files') but does not provide actual prompt text. No appendix with full prompts."
    158         },
    159         "hyperparameters_reported": {
    160           "applies": true,
    161           "answer": true,
    162           "justification": "Section 4 specifies: greedy decoding by default, sampling temperature of 0.8, chunk size 512, chunk overlap 0, top 3 suspicious files, 4 samples of edit locations, 10 patches per location set, 40 total patches, 40 reproduction test samples, and context window of +/- 10 lines."
    163         },
    164         "scaffolding_described": {
    165           "applies": true,
    166           "answer": true,
    167           "justification": "The three-phase pipeline (localization, repair, patch validation) is described in detail in Section 3 with workflow diagram (Figure 1). Each step's inputs and outputs are explicitly described. While AGENTLESS is explicitly NOT agentic, it does use LLM-based scaffolding (hierarchical prompting, reproduction test generation, majority voting) which is well-described."
    168         },
    169         "data_preprocessing_documented": {
    170           "applies": true,
    171           "answer": true,
    172           "justification": "Section 6.1 describes the manual classification procedure with categories and criteria. Section 6.2 documents the filtering from SWE-bench Lite (300) to SWE-bench Lite-S (249) with explicit exclusion criteria."
    173         }
    174       },
    175       "limitations_and_scope": {
    176         "limitations_section_present": {
    177           "applies": true,
    178           "answer": true,
    179           "justification": "Section 7 'Threats to Validity' provides dedicated discussion of internal (data leakage) and external (benchmark generalization) threats."
    180         },
    181         "threats_to_validity_specific": {
    182           "applies": true,
    183           "answer": true,
    184           "justification": "The threats are specific: internal threat identifies GPT-4o training data leakage on SWE-bench patches specifically, citing the SWE-bench authors' analysis. External threat identifies SWE-bench Lite as the sole evaluation dataset. Section 6.2 also specifically identifies that AGENTLESS underperforms on problems with no location clues."
    185         },
    186         "scope_boundaries_stated": {
    187           "applies": true,
    188           "answer": false,
    189           "justification": "Section 7 acknowledges results 'might not generalize to other datasets' but does not explicitly state what the results do NOT show. Does not bound claims to Python only, does not acknowledge the single-model (GPT-4o) limitation, does not state that results are for bug-fixing tasks specifically."
    190         }
    191       },
    192       "data_integrity": {
    193         "raw_data_available": {
    194           "applies": true,
    195           "answer": true,
    196           "justification": "SWE-bench Lite is publicly available with all issues and ground truth patches. AGENTLESS is open-sourced, so generated patches can be reproduced and verified."
    197         },
    198         "data_collection_described": {
    199           "applies": true,
    200           "answer": true,
    201           "justification": "The benchmark data originates from SWE-bench (cited), whose collection methodology is documented in the original paper. For the manual classification in Section 6.1, the dimensions and categories are described in detail."
    202         },
    203         "recruitment_methods_described": {
    204           "applies": false,
    205           "answer": false,
    206           "justification": "No human participants were recruited. The evaluation uses a standard public benchmark (SWE-bench Lite). The manual classification was performed by the authors themselves."
    207         },
    208         "data_pipeline_documented": {
    209           "applies": true,
    210           "answer": true,
    211           "justification": "The pipeline from SWE-bench Lite (300 problems) to SWE-bench Lite-S (249 problems) is documented with filtering criteria (exact patches, misleading solutions, insufficient information). The AGENTLESS pipeline from localization to patch selection is fully documented."
    212         }
    213       },
    214       "conflicts_of_interest": {
    215         "funding_disclosed": {
    216           "applies": true,
    217           "answer": false,
    218           "justification": "The acknowledgments section thanks two individuals and mentions a bike but discloses no funding source (grants, corporate sponsors, or funding agencies)."
    219         },
    220         "affiliations_disclosed": {
    221           "applies": true,
    222           "answer": true,
    223           "justification": "All four authors are listed with University of Illinois Urbana-Champaign affiliation. They evaluate third-party models (GPT-4o, Claude 3.5 Sonnet) and are not affiliated with OpenAI or Anthropic."
    224         },
    225         "funder_independent_of_outcome": {
    226           "applies": true,
    227           "answer": false,
    228           "justification": "No funding source is disclosed. The authors are at a major research university using substantial API credits, so funding likely exists but is not disclosed. Cannot confirm funder independence without knowing the funder."
    229         },
    230         "financial_interests_declared": {
    231           "applies": true,
    232           "answer": false,
    233           "justification": "No competing interests statement is present in the paper. The authors may have interests via the AGENTLESS GitHub project or its commercial adoption, but no disclosure is made."
    234         }
    235       },
    236       "contamination": {
    237         "training_cutoff_stated": {
    238           "applies": true,
    239           "answer": false,
    240           "justification": "Section 7 discusses contamination risk but states 'GPT-4o is a closed-source model, we do not have access to the training data.' The specific training cutoff date for GPT-4o is not provided."
    241         },
    242         "train_test_overlap_discussed": {
    243           "applies": true,
    244           "answer": true,
    245           "justification": "Section 7 explicitly addresses this: 'One threat to validity comes from the data leakage of ground truth developer patches in SWE-bench Lite being part of the training data for GPT-4o.' Cites SWE-bench authors' temporal analysis as mitigating evidence."
    246         },
    247         "benchmark_contamination_addressed": {
    248           "applies": true,
    249           "answer": true,
    250           "justification": "Section 7 discusses the contamination risk and cites the SWE-bench authors' finding of 'no significant difference' in resolve rates before/after GPT-4's knowledge cutoff. While incomplete for GPT-4o specifically, the concern is explicitly addressed."
    251         }
    252       },
    253       "human_studies": {
    254         "pre_registered": {
    255           "applies": false,
    256           "answer": false,
    257           "justification": "No human participants. This is a benchmark evaluation paper."
    258         },
    259         "irb_or_ethics_approval": {
    260           "applies": false,
    261           "answer": false,
    262           "justification": "No human participants involved."
    263         },
    264         "demographics_reported": {
    265           "applies": false,
    266           "answer": false,
    267           "justification": "No human participants involved."
    268         },
    269         "inclusion_exclusion_criteria": {
    270           "applies": false,
    271           "answer": false,
    272           "justification": "No human participants involved."
    273         },
    274         "randomization_described": {
    275           "applies": false,
    276           "answer": false,
    277           "justification": "No human participants involved."
    278         },
    279         "blinding_described": {
    280           "applies": false,
    281           "answer": false,
    282           "justification": "No human participants involved."
    283         },
    284         "attrition_reported": {
    285           "applies": false,
    286           "answer": false,
    287           "justification": "No human participants involved."
    288         }
    289       },
    290       "cost_and_practicality": {
    291         "inference_cost_reported": {
    292           "applies": true,
    293           "answer": true,
    294           "justification": "Table 1 reports average cost per issue as $0.70. Per-component costs are broken down in Tables 2-4 (e.g., $0.02 for prompting-based file localization, $0.25 for reproduction tests)."
    295         },
    296         "compute_budget_stated": {
    297           "applies": true,
    298           "answer": false,
    299           "justification": "Per-issue average cost is reported ($0.70) but the total computational budget across all 300 problems is not explicitly stated. No hardware specs, GPU hours, or wall-clock time are provided."
    300         }
    301       }
    302     },
    303     "comparison_summary": {
    304       "sonnet_generous_count": 1,
    305       "opus_generous_count": 0,
    306       "applies_boundary_count": 1,
    307       "interpretive_count": 0,
    308       "disagreement_categories": {
    309         "evaluation_design.human_evaluation": "sonnet_generous — Sonnet credits manual benchmark classification as human evaluation; schema explicitly excludes this.",
    310         "conflicts_of_interest.funder_independent_of_outcome": "applies_boundary — Sonnet sets applies=false (assuming unfunded); Opus sets applies=true (absence of disclosure is not confirmation of absence)."
    311       },
    312       "notes": "High agreement (48/50 = 96%). Two disagreements: (1) human_evaluation: Sonnet credits Section 6.1 manual classification of SWE-bench Lite problems as human evaluation despite the schema explicitly stating 'manual classification of the benchmark or dataset itself does not count.' Sonnet even notes in its own justification 'though not of system outputs directly' but still answers YES — a clear generosity error. (2) funder_independent_of_outcome: Sonnet sets applies=false reasoning the work is unfunded. However, the authors are at UIUC using substantial OpenAI API compute; unfunded status is not confirmed. The schema says 'NA if unfunded' but absence of disclosure is not evidence of absence. Both disagreements follow known Sonnet patterns: generosity bias (crediting partial evidence as YES) and NA boundary confusion (using applies=false to avoid a negative answer)."
    313     }
    314   }
    315 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs