anchors.yaml - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

anchors.yaml (8175B)
      1 # Calibration anchor set for rubric weight learning.
      2 #
      3 # Each anchor = a paper ID + a target score band [low, high] + a rationale.
      4 # Bands are RANGES, not exact scores. Label what you believe is true; let the
      5 # optimizer fit weights that respect those beliefs.
      6 #
      7 # Guidelines for labeling:
      8 #   - 0-20   "bad": methodologically broken (fraud, unsupported causal claims,
      9 #                   trivial sample, industry overview with zero rigor)
     10 #   - 20-40  "weak": real but underpowered or overclaimed
     11 #   - 40-60  "typical": median-of-field. Rigor varies; nothing disqualifying
     12 #   - 60-80  "good": clearly rigorous, transparent, reproducible
     13 #   - 80-95  "excellent": landmark methodology papers, meta-analyses, tight
     14 #                         design + full artifact release
     15 #
     16 # Aim for 15-25 anchors spread across the range. Too few and the optimizer
     17 # overfits. Too many of one band and you push everything toward the middle.
     18 #
     19 # Run fit-weights.py after edits. Commits weights.json next to this file.
     20 
     21 anchors:
     22 
     23   # =====================================================================
     24   # Known-bad (0-20): disqualifying flaws visible from the paper itself
     25   # =====================================================================
     26   - id: wakefield-ileal-lymphoid-1998
     27     band: [0, 15]
     28     rationale: Retracted MMR fraud. Causal claim from N=12 case series, no
     29       control, selection bias, undisclosed COI, no mechanism, no prior
     30       plausibility. Should be a score floor.
     31 
     32   - id: aidriven-software-engineering-2023
     33     band: [0, 20]
     34     rationale: Industry opinion/overview piece. No empirical content, no
     35       methodology, no data. Representative of the "thin survey" class that
     36       the rubric should heavily penalize. Current score 2.7.
     37 
     38   - id: automating-rest-api-2024
     39     band: [0, 20]
     40     rationale: Automating REST API Postman Test Cases Using LLM. Empirical
     41       in form but paper-thin. Tiny scope, minimal methodology. Forces the
     42       optimizer to care about stat_methodology / evaluation_design even on
     43       bad papers - previous fit zero'd those categories because Wakefield
     44       alone couldn't constrain them. Score 2.4.
     45 
     46   - id: aipowered-code-review-2024
     47     band: [5, 25]
     48     rationale: AI-powered Code Review with LLMs, Early Results. Early
     49       empirical work, underpowered, limited eval. Another empirical-bad
     50       anchor so the rubric can't down-weight stats-adjacent categories
     51       to zero and still hit the bad-paper targets. Score 7.7.
     52 
     53   - id: introduction-generative-ai-2025
     54     band: [0, 20]
     55     rationale: Introduction to Generative AI and DevOps. Explicit tutorial,
     56       narrative, no empirics. Score 7.1.
     57 
     58   - id: generative-ai-software-2024
     59     band: [0, 22]
     60     rationale: Generative AI in Software Development, An Overview and
     61       Evaluation. Narrative review, qualitative comparison of seven coding
     62       tools, no systematic methodology. Score 6.9.
     63 
     64   # =====================================================================
     65   # Known-good (70-90): rigorous, landmark, or methodology reference
     66   # =====================================================================
     67   - id: attention-is-all-you-need-2017
     68     band: [70, 85]
     69     rationale: Foundational transformer architecture. Clear methods, clear
     70       contribution, extensive ablations (given era). Currently scored 52.8
     71       which conflates "methodology reporting" with "limited artifact
     72       practice of its era" - should clearly outrank Wakefield.
     73 
     74   - id: bert-pretraining-deep-2018
     75     band: [70, 85]
     76     rationale: Landmark pre-training paper. Careful ablations, public model,
     77       reproducible. Currently 55.0.
     78 
     79   - id: deep-rl-matters-2018
     80     band: [80, 92]
     81     rationale: Rigorous meta-analysis of RL reproducibility problems. Sets the
     82       standard for methodology critique. Currently 91.2 (keep as high-band
     83       anchor, the rubric already treats it well).
     84 
     85   - id: show-your-work-2019
     86     band: [80, 92]
     87     rationale: Improved Reporting of Experimental Results. The paper advocating
     88       for rigor is itself high-rigor. Currently 91.4.
     89 
     90   - id: alphacode-competition-level-2022
     91     band: [75, 90]
     92     rationale: Thorough evaluation, clear methodology, DeepMind scale of
     93       reporting. Currently 85.7.
     94 
     95   - id: arc-measure-intelligence-2019
     96     band: [70, 85]
     97     rationale: Chollet's conceptual landmark on what intelligence measurement
     98       requires. Currently 64.7.
     99 
    100   - id: react-synergizing-reasoning-2022
    101     band: [60, 80]
    102     rationale: Enabled the modern agentic era. Methodologically solid for
    103       its contribution type. Currently 48.2.
    104 
    105   - id: leakage-reproducibility-crisis-2023
    106     band: [80, 92]
    107     rationale: Kapoor and Narayanan. Systematic review of data leakage across
    108       ML science. Rigorous, reproducible, influential - essentially the
    109       modern heir to Ioannidis for the ML field. Currently 81.2.
    110 
    111   - id: gans-created-equal-2018
    112     band: [78, 92]
    113     rationale: Rigorous large-scale meta-analysis showing GAN progress claims
    114       were noise. Exactly the kind of high-rigor critique the field relies on.
    115       Currently 81.1.
    116 
    117   - id: troubling-trends-ml-2018
    118     band: [75, 90]
    119     rationale: Lipton and Steinhardt. Structured critique of ML research
    120       practices. Important, well-argued, influenced field norms. Currently 81.8.
    121 
    122   - id: questionable-practices-ml-2024
    123     band: [72, 88]
    124     rationale: Direct successor to Troubling Trends. Catalogs QRPs in ML with
    125       examples. Currently 66.7 - band slightly below ideal to match what the
    126       rubric currently captures.
    127 
    128   - id: reforms-consensus-ml-2024
    129     band: [75, 90]
    130     rationale: REFORMS checklist. Community consensus work on reporting
    131       standards for ML-based science. The artifact IS the methodology.
    132       Currently 73.1.
    133 
    134   # =====================================================================
    135   # Middling (40-60): typical papers at the corpus median
    136   # =====================================================================
    137   - id: codebert-pretrained-model-2020
    138     band: [45, 62]
    139     rationale: Foundational code pre-training. Clear method, adequate
    140       evaluation. Representative "solid middle of field" paper. Currently 52.5.
    141 
    142   - id: chain-of-thought-prompting-2022
    143     band: [50, 70]
    144     rationale: Landmark prompting-technique paper. Straightforward method,
    145       limited ablations, good impact. Currently 56.6.
    146 
    147 # =====================================================================
    148 # Pairwise ordering constraints (soft, in addition to bands)
    149 # =====================================================================
    150 # After fitting, these pairs should hold. Optimizer applies a hinge
    151 # penalty if they don't. Fill in as you add anchors.
    152 pairs:
    153   - [wakefield-ileal-lymphoid-1998, attention-is-all-you-need-2017]
    154   - [wakefield-ileal-lymphoid-1998, bert-pretraining-deep-2018]
    155   - [wakefield-ileal-lymphoid-1998, deep-rl-matters-2018]
    156   - [wakefield-ileal-lymphoid-1998, show-your-work-2019]
    157   - [aidriven-software-engineering-2023, attention-is-all-you-need-2017]
    158   - [aidriven-software-engineering-2023, react-synergizing-reasoning-2022]
    159   - [codebert-pretrained-model-2020, deep-rl-matters-2018]
    160   - [codebert-pretrained-model-2020, show-your-work-2019]
    161   - [wakefield-ileal-lymphoid-1998, leakage-reproducibility-crisis-2023]
    162   - [wakefield-ileal-lymphoid-1998, gans-created-equal-2018]
    163   - [wakefield-ileal-lymphoid-1998, troubling-trends-ml-2018]
    164   - [automating-rest-api-2024, react-synergizing-reasoning-2022]
    165   - [automating-rest-api-2024, bert-pretraining-deep-2018]
    166   - [aipowered-code-review-2024, show-your-work-2019]
    167   - [introduction-generative-ai-2025, leakage-reproducibility-crisis-2023]
    168   - [generative-ai-software-2024, attention-is-all-you-need-2017]
    169 
    170 # Optimization settings. Leave defaults unless you know why you're changing.
    171 settings:
    172   level: category          # "category" (14 params) or "question" (~60 params)
    173   min_weight: 0.0
    174   max_weight: 5.0
    175   l2_reg: 0.3              # Penalty against deviating from uniform weights.
    176   pair_margin: 20.0        # Desired separation between pairs (in score pts).
    177   pair_penalty: 2.0        # Weight on pair ordering violations vs band fit.
    178   seed: 42
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs