anchors.yaml (8175B)
1 # Calibration anchor set for rubric weight learning. 2 # 3 # Each anchor = a paper ID + a target score band [low, high] + a rationale. 4 # Bands are RANGES, not exact scores. Label what you believe is true; let the 5 # optimizer fit weights that respect those beliefs. 6 # 7 # Guidelines for labeling: 8 # - 0-20 "bad": methodologically broken (fraud, unsupported causal claims, 9 # trivial sample, industry overview with zero rigor) 10 # - 20-40 "weak": real but underpowered or overclaimed 11 # - 40-60 "typical": median-of-field. Rigor varies; nothing disqualifying 12 # - 60-80 "good": clearly rigorous, transparent, reproducible 13 # - 80-95 "excellent": landmark methodology papers, meta-analyses, tight 14 # design + full artifact release 15 # 16 # Aim for 15-25 anchors spread across the range. Too few and the optimizer 17 # overfits. Too many of one band and you push everything toward the middle. 18 # 19 # Run fit-weights.py after edits. Commits weights.json next to this file. 20 21 anchors: 22 23 # ===================================================================== 24 # Known-bad (0-20): disqualifying flaws visible from the paper itself 25 # ===================================================================== 26 - id: wakefield-ileal-lymphoid-1998 27 band: [0, 15] 28 rationale: Retracted MMR fraud. Causal claim from N=12 case series, no 29 control, selection bias, undisclosed COI, no mechanism, no prior 30 plausibility. Should be a score floor. 31 32 - id: aidriven-software-engineering-2023 33 band: [0, 20] 34 rationale: Industry opinion/overview piece. No empirical content, no 35 methodology, no data. Representative of the "thin survey" class that 36 the rubric should heavily penalize. Current score 2.7. 37 38 - id: automating-rest-api-2024 39 band: [0, 20] 40 rationale: Automating REST API Postman Test Cases Using LLM. Empirical 41 in form but paper-thin. Tiny scope, minimal methodology. Forces the 42 optimizer to care about stat_methodology / evaluation_design even on 43 bad papers - previous fit zero'd those categories because Wakefield 44 alone couldn't constrain them. Score 2.4. 45 46 - id: aipowered-code-review-2024 47 band: [5, 25] 48 rationale: AI-powered Code Review with LLMs, Early Results. Early 49 empirical work, underpowered, limited eval. Another empirical-bad 50 anchor so the rubric can't down-weight stats-adjacent categories 51 to zero and still hit the bad-paper targets. Score 7.7. 52 53 - id: introduction-generative-ai-2025 54 band: [0, 20] 55 rationale: Introduction to Generative AI and DevOps. Explicit tutorial, 56 narrative, no empirics. Score 7.1. 57 58 - id: generative-ai-software-2024 59 band: [0, 22] 60 rationale: Generative AI in Software Development, An Overview and 61 Evaluation. Narrative review, qualitative comparison of seven coding 62 tools, no systematic methodology. Score 6.9. 63 64 # ===================================================================== 65 # Known-good (70-90): rigorous, landmark, or methodology reference 66 # ===================================================================== 67 - id: attention-is-all-you-need-2017 68 band: [70, 85] 69 rationale: Foundational transformer architecture. Clear methods, clear 70 contribution, extensive ablations (given era). Currently scored 52.8 71 which conflates "methodology reporting" with "limited artifact 72 practice of its era" - should clearly outrank Wakefield. 73 74 - id: bert-pretraining-deep-2018 75 band: [70, 85] 76 rationale: Landmark pre-training paper. Careful ablations, public model, 77 reproducible. Currently 55.0. 78 79 - id: deep-rl-matters-2018 80 band: [80, 92] 81 rationale: Rigorous meta-analysis of RL reproducibility problems. Sets the 82 standard for methodology critique. Currently 91.2 (keep as high-band 83 anchor, the rubric already treats it well). 84 85 - id: show-your-work-2019 86 band: [80, 92] 87 rationale: Improved Reporting of Experimental Results. The paper advocating 88 for rigor is itself high-rigor. Currently 91.4. 89 90 - id: alphacode-competition-level-2022 91 band: [75, 90] 92 rationale: Thorough evaluation, clear methodology, DeepMind scale of 93 reporting. Currently 85.7. 94 95 - id: arc-measure-intelligence-2019 96 band: [70, 85] 97 rationale: Chollet's conceptual landmark on what intelligence measurement 98 requires. Currently 64.7. 99 100 - id: react-synergizing-reasoning-2022 101 band: [60, 80] 102 rationale: Enabled the modern agentic era. Methodologically solid for 103 its contribution type. Currently 48.2. 104 105 - id: leakage-reproducibility-crisis-2023 106 band: [80, 92] 107 rationale: Kapoor and Narayanan. Systematic review of data leakage across 108 ML science. Rigorous, reproducible, influential - essentially the 109 modern heir to Ioannidis for the ML field. Currently 81.2. 110 111 - id: gans-created-equal-2018 112 band: [78, 92] 113 rationale: Rigorous large-scale meta-analysis showing GAN progress claims 114 were noise. Exactly the kind of high-rigor critique the field relies on. 115 Currently 81.1. 116 117 - id: troubling-trends-ml-2018 118 band: [75, 90] 119 rationale: Lipton and Steinhardt. Structured critique of ML research 120 practices. Important, well-argued, influenced field norms. Currently 81.8. 121 122 - id: questionable-practices-ml-2024 123 band: [72, 88] 124 rationale: Direct successor to Troubling Trends. Catalogs QRPs in ML with 125 examples. Currently 66.7 - band slightly below ideal to match what the 126 rubric currently captures. 127 128 - id: reforms-consensus-ml-2024 129 band: [75, 90] 130 rationale: REFORMS checklist. Community consensus work on reporting 131 standards for ML-based science. The artifact IS the methodology. 132 Currently 73.1. 133 134 # ===================================================================== 135 # Middling (40-60): typical papers at the corpus median 136 # ===================================================================== 137 - id: codebert-pretrained-model-2020 138 band: [45, 62] 139 rationale: Foundational code pre-training. Clear method, adequate 140 evaluation. Representative "solid middle of field" paper. Currently 52.5. 141 142 - id: chain-of-thought-prompting-2022 143 band: [50, 70] 144 rationale: Landmark prompting-technique paper. Straightforward method, 145 limited ablations, good impact. Currently 56.6. 146 147 # ===================================================================== 148 # Pairwise ordering constraints (soft, in addition to bands) 149 # ===================================================================== 150 # After fitting, these pairs should hold. Optimizer applies a hinge 151 # penalty if they don't. Fill in as you add anchors. 152 pairs: 153 - [wakefield-ileal-lymphoid-1998, attention-is-all-you-need-2017] 154 - [wakefield-ileal-lymphoid-1998, bert-pretraining-deep-2018] 155 - [wakefield-ileal-lymphoid-1998, deep-rl-matters-2018] 156 - [wakefield-ileal-lymphoid-1998, show-your-work-2019] 157 - [aidriven-software-engineering-2023, attention-is-all-you-need-2017] 158 - [aidriven-software-engineering-2023, react-synergizing-reasoning-2022] 159 - [codebert-pretrained-model-2020, deep-rl-matters-2018] 160 - [codebert-pretrained-model-2020, show-your-work-2019] 161 - [wakefield-ileal-lymphoid-1998, leakage-reproducibility-crisis-2023] 162 - [wakefield-ileal-lymphoid-1998, gans-created-equal-2018] 163 - [wakefield-ileal-lymphoid-1998, troubling-trends-ml-2018] 164 - [automating-rest-api-2024, react-synergizing-reasoning-2022] 165 - [automating-rest-api-2024, bert-pretraining-deep-2018] 166 - [aipowered-code-review-2024, show-your-work-2019] 167 - [introduction-generative-ai-2025, leakage-reproducibility-crisis-2023] 168 - [generative-ai-software-2024, attention-is-all-you-need-2017] 169 170 # Optimization settings. Leave defaults unless you know why you're changing. 171 settings: 172 level: category # "category" (14 params) or "question" (~60 params) 173 min_weight: 0.0 174 max_weight: 5.0 175 l2_reg: 0.3 # Penalty against deviating from uniform weights. 176 pair_margin: 20.0 # Desired separation between pairs (in score pts). 177 pair_penalty: 2.0 # Weight on pair ordering violations vs band fit. 178 seed: 42