calibration.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

calibration.json (21704B)
      1 {
      2   "paper_slug": "adaptive-visionbased-coverage-2025",
      3   "calibrator": "opus",
      4   "calibration_date": "2026-02-28",
      5   "total_questions": 50,
      6   "agreement_count": 46,
      7   "disagreement_count": 4,
      8   "agreement_rate": 0.92,
      9   "disagreements": [
     10     {
     11       "category": "statistical_methodology",
     12       "question": "effect_sizes_reported",
     13       "sonnet": {"applies": true, "answer": false},
     14       "opus": {"applies": true, "answer": true},
     15       "direction": "opus_generous",
     16       "explanation": "The schema states that 'A paper that says 12% improvement over baseline (from 45% to 57%) provides enough context for YES.' This paper reports '26.5% improvement in coverage' with raw numbers visible in Table 2 (91.8% vs baseline values), '32% reduction in energy consumption' (68 vs 85/94/100 units), and similar relative differences for all metrics. These are percentage improvements with baseline context, which the schema explicitly says qualifies as effect sizes. While the baselines chosen for percentage calculations are inconsistent (a separate concern), effect size magnitudes ARE reported."
     17     },
     18     {
     19       "category": "statistical_methodology",
     20       "question": "variance_reported",
     21       "sonnet": {"applies": true, "answer": true},
     22       "opus": {"applies": true, "answer": false},
     23       "direction": "sonnet_generous",
     24       "explanation": "Sonnet credits the paper based on the statement that experiments are 'repeated five times with different random seeds' and that Table 2 includes 'Learning Stability (Variance).' However, the 'Learning Stability (Variance)' metric in Table 2 is reward variance within training (a performance metric comparing methods), NOT variance across the five experimental runs. The schema requires 'std dev in tables, variance across seeds, interquartile range, multiple-run results with spread measures.' The paper reports only averaged point estimates in Table 2 without any standard deviation or confidence intervals across the five runs. The mention that 'standard deviation of reward dropped below 5% of the mean after episode 100' describes within-run convergence behavior, not cross-run variability. No spread measure for the main results is provided."
     25     },
     26     {
     27       "category": "evaluation_design",
     28       "question": "ablation_study",
     29       "sonnet": {"applies": true, "answer": true},
     30       "opus": {"applies": true, "answer": false},
     31       "direction": "sonnet_generous",
     32       "explanation": "Sonnet credits the reward coefficient sensitivity analysis (varying alpha, beta, gamma across 27 configurations) as an ablation study. However, this is a hyperparameter sensitivity analysis, not an ablation study. The schema asks: 'Is there an ablation study showing which components matter? Look for: experiments that remove or modify individual components to measure their contribution.' The paper's system has multiple distinct components (Dueling DQN architecture, Prioritized Experience Replay, target network, vision-based localization vs. alternatives, Double Q-learning). None of these are removed or isolated to measure their individual contribution. Sonnet itself acknowledges 'There is no ablation of architectural components.' Varying reward coefficients tests the sensitivity of a single design choice, not the contribution of system components."
     33     },
     34     {
     35       "category": "claims_and_evidence",
     36       "question": "abstract_claims_supported",
     37       "sonnet": {"applies": true, "answer": true},
     38       "opus": {"applies": true, "answer": false},
     39       "direction": "sonnet_generous",
     40       "explanation": "The abstract claims '26.5% improvement in coverage' and '45% extension of network lifetime' 'compared to traditional distance-based localization.' The distance-based localization baseline is the non-vision MARL method (78.6% coverage). 91.8% vs 78.6% is a 16.8% improvement, not 26.5%. The 26.5% figure only matches if compared against static grid (72.5%), which is not distance-based localization. For network lifetime: 160 vs 120 (non-vision MARL) = 33%, not 45%. The 45% only matches against random placement (110 episodes), which is also not 'distance-based localization.' The abstract's claimed percentages are not supported by the results when the stated comparison baseline (distance-based localization) is used. Sonnet noted 'the baseline used for percentage calculation is not always made explicit' but still gave YES."
     41     }
     42   ],
     43   "opus_checklist": {
     44     "artifacts": {
     45       "code_released": {
     46         "applies": true,
     47         "answer": false,
     48         "justification": "No repository URL, GitHub link, or code archive is provided anywhere in the paper. The implementation is described (Python, TensorFlow, OpenAI Gym-style interfaces) but no code is released."
     49       },
     50       "data_released": {
     51         "applies": true,
     52         "answer": false,
     53         "justification": "No dataset is released. The CNN was trained on a synthetic dataset of labeled sensor images, but this dataset is not made available. The simulation environment is custom and not shared."
     54       },
     55       "environment_specified": {
     56         "applies": true,
     57         "answer": false,
     58         "justification": "The paper mentions Python and TensorFlow but provides no version numbers, no requirements.txt, no Dockerfile, and no detailed environment specification. This is insufficient to recreate the environment."
     59       },
     60       "reproduction_instructions": {
     61         "applies": true,
     62         "answer": false,
     63         "justification": "No step-by-step reproduction instructions, README, or reproduction scripts are provided. While hyperparameters and architecture details are described, there are no instructions for running the experiments."
     64       }
     65     },
     66     "statistical_methodology": {
     67       "confidence_intervals_or_error_bars": {
     68         "applies": true,
     69         "answer": false,
     70         "justification": "All results in Table 2 and the text are reported as point estimates (e.g., 91.8% coverage, 68 energy units). No confidence intervals, error bars, or uncertainty ranges are provided despite averaging over 5 runs."
     71       },
     72       "significance_tests": {
     73         "applies": true,
     74         "answer": false,
     75         "justification": "The paper claims the proposed method 'consistently outperforms all baseline strategies' but performs no statistical significance tests. All comparisons are based on directly contrasting point estimates without p-values or any formal test."
     76       },
     77       "effect_sizes_reported": {
     78         "applies": true,
     79         "answer": true,
     80         "justification": "The paper reports percentage improvements with baseline context: '26.5% improvement in coverage' (with raw numbers 91.8% vs baselines in Table 2), '32% reduction in energy consumption' (68 vs 85/94/100 units), '22% decrease in redundancy,' and '45% extension of network lifetime.' The schema explicitly states that percentage improvement with baseline context counts as YES."
     81       },
     82       "sample_size_justified": {
     83         "applies": true,
     84         "answer": false,
     85         "justification": "The choice of 100 sensor nodes, 200 training episodes, and 5 random seeds is not justified. No power analysis or argument is given for why these quantities are sufficient to draw reliable conclusions."
     86       },
     87       "variance_reported": {
     88         "applies": true,
     89         "answer": false,
     90         "justification": "Results in Table 2 are averaged over 5 runs but reported as point estimates only. The 'Learning Stability (Variance)' metric in Table 2 is reward variance within training (a performance metric), not variance across experimental runs. The statement about standard deviation dropping below 5% of the mean after episode 100 describes within-run convergence, not cross-run variability. No standard deviation, IQR, or spread measure for the main results across runs is provided."
     91       }
     92     },
     93     "evaluation_design": {
     94       "baselines_included": {
     95         "applies": true,
     96         "answer": true,
     97         "justification": "The paper compares against six methods: random placement, static grid deployment, centralized DRL (Liu & Wang 2021), RSSI-based MARL (Zhao & Chen 2022), graph-based MARL (Feng & Zhang 2023), and SLAM-DRL (Khan & Wu 2024). Multiple naive and state-of-the-art baselines are included."
     98       },
     99       "baselines_contemporary": {
    100         "applies": true,
    101         "answer": true,
    102         "justification": "The DRL/MARL baselines include work from 2021, 2022, 2023, and 2024, which are recent and represent current approaches. The paper also includes simpler baselines (random, static grid) for reference."
    103       },
    104       "ablation_study": {
    105         "applies": true,
    106         "answer": false,
    107         "justification": "The paper performs a reward coefficient sensitivity analysis (varying alpha, beta, gamma across 27 configurations), but this is parameter tuning, not an ablation study. No system components are removed or isolated — the Dueling DQN architecture, Prioritized Experience Replay, target network stabilization, and vision-based localization are never individually ablated to measure their contribution."
    108       },
    109       "multiple_metrics": {
    110         "applies": true,
    111         "answer": true,
    112         "justification": "Table 2 reports six metrics: final area coverage (%), energy consumption (units), redundancy rate (%), recovery time (seconds), network lifetime (episodes), and learning stability (variance). Multiple complementary metrics are used."
    113       },
    114       "human_evaluation": {
    115         "applies": false,
    116         "answer": false,
    117         "justification": "This is a sensor network coverage optimization paper evaluated entirely with automated simulation metrics. Human evaluation is not relevant to the claims."
    118       },
    119       "held_out_test_set": {
    120         "applies": true,
    121         "answer": false,
    122         "justification": "The system is trained and evaluated in the same simulation environment configuration (100 nodes, 500x500m, same parameters). There is no separate held-out test environment with different parameters used exclusively for final evaluation."
    123       },
    124       "per_category_breakdown": {
    125         "applies": true,
    126         "answer": false,
    127         "justification": "Results are reported as single aggregate numbers per method in Table 2. The scalability tests with 200 and 300 sensors are mentioned briefly in one sentence but not compared against baselines. No per-scenario, per-density, or per-condition breakdown is provided."
    128       },
    129       "failure_cases_discussed": {
    130         "applies": true,
    131         "answer": false,
    132         "justification": "The paper does not discuss failure cases or conditions under which the proposed method underperforms. All reported results show the proposed method as uniformly superior across all metrics and conditions."
    133       },
    134       "negative_results_reported": {
    135         "applies": true,
    136         "answer": false,
    137         "justification": "Every experiment shows the proposed method as best. The reward sensitivity analysis notes that reducing beta leads to 'slightly lower total coverage' but this is presented positively as showing robustness. No configurations or conditions where the approach fails are reported."
    138       }
    139     },
    140     "claims_and_evidence": {
    141       "abstract_claims_supported": {
    142         "applies": true,
    143         "answer": false,
    144         "justification": "The abstract claims '26.5% improvement in coverage' and '45% extension of network lifetime' 'compared to traditional distance-based localization.' The distance-based localization baseline (non-vision MARL using RSSI) achieves 78.6% coverage and 120 episodes lifetime. The actual improvements over this baseline are 16.8% (coverage) and 33% (lifetime), not 26.5% and 45%. The claimed percentages only match if cherry-picking different, weaker baselines for each metric (static grid for coverage, random placement for lifetime)."
    145       },
    146       "causal_claims_justified": {
    147         "applies": true,
    148         "answer": false,
    149         "justification": "The paper claims 'incorporating real-time vision feedback into decentralized MARL significantly improves sensing efficiency' and that 'vision-based feedback eliminates the energy cost associated with inter-node localization messaging.' These are causal claims. The baselines differ in multiple ways simultaneously (localization method, architecture, communication), so the vision component's causal contribution cannot be isolated."
    150       },
    151       "generalization_bounded": {
    152         "applies": true,
    153         "answer": false,
    154         "justification": "All experiments use a single 2D 500x500m simulation with 100 nodes, but the conclusion claims applicability to 'real-world MWSN deployments' in 'environmental monitoring and smart cities to disaster response and autonomous exploration.' The paper makes no attempt to bound these claims to the tested settings."
    155       },
    156       "alternative_explanations_discussed": {
    157         "applies": true,
    158         "answer": false,
    159         "justification": "No alternative explanations are discussed. The paper does not consider whether results could stem from weak baseline implementations (described as 'simplified versions'), favorable hyperparameter tuning, or simulation artifacts. There is no threats-to-validity section."
    160       }
    161     },
    162     "setup_transparency": {
    163       "model_versions_specified": {
    164         "applies": true,
    165         "answer": false,
    166         "justification": "The paper states 'TensorFlow' without a version number. No version is given for Python or any other library used. While the RL architecture is described (Dueling Double DQN with 128/64/32 neurons), the software framework versions needed for reproduction are absent."
    167       },
    168       "prompts_provided": {
    169         "applies": false,
    170         "answer": false,
    171         "justification": "This paper uses deep reinforcement learning agents, not language model prompting. Prompts are not applicable."
    172       },
    173       "hyperparameters_reported": {
    174         "applies": true,
    175         "answer": true,
    176         "justification": "Section IV reports comprehensive hyperparameters: learning rate 1e-4, discount factor gamma=0.99, mini-batch size 64, epsilon decay from 1.0 to 0.1 over 200 episodes, replay buffer size 100,000, target network update every 1000 steps, reward coefficients alpha=1, beta=2, gamma=3, network architecture (128/64/32 neurons)."
    177       },
    178       "scaffolding_described": {
    179         "applies": false,
    180         "answer": false,
    181         "justification": "This paper uses multi-agent reinforcement learning, not LLM-based agentic scaffolding. The MARL framework is described in detail but scaffolding in the LLM sense does not apply."
    182       },
    183       "data_preprocessing_documented": {
    184         "applies": true,
    185         "answer": true,
    186         "justification": "The vision processing pipeline is described in Section III: Gaussian blurring, adaptive thresholding, bounding box extraction, geometric calibration via fiducial markers, histogram equalization, and temporal filtering. The CNN architecture (3 conv layers with 3x3 kernels, ReLU, max pooling, 2 FC layers) is also detailed."
    187       }
    188     },
    189     "limitations_and_scope": {
    190       "limitations_section_present": {
    191         "applies": true,
    192         "answer": false,
    193         "justification": "There is no dedicated limitations or threats-to-validity section. The paper proceeds directly from performance evaluation to the conclusion section, which discusses only future work directions without acknowledging current limitations."
    194       },
    195       "threats_to_validity_specific": {
    196         "applies": true,
    197         "answer": false,
    198         "justification": "No specific threats to validity are discussed anywhere in the paper. The brief mention that 'exact hardware-level power profiling was not performed' is the closest, but this is buried in the energy modeling discussion, not presented as a limitation."
    199       },
    200       "scope_boundaries_stated": {
    201         "applies": true,
    202         "answer": false,
    203         "justification": "The paper makes no explicit statements about what its results do NOT show. The conclusion claims broad applicability to 'environmental monitoring and smart cities to disaster response and autonomous exploration' without bounding to the tested 2D simulation setting."
    204       }
    205     },
    206     "data_integrity": {
    207       "raw_data_available": {
    208         "applies": true,
    209         "answer": false,
    210         "justification": "No raw simulation logs, per-run data, episode-level trajectories, or experimental data are made available. Only aggregated results in Table 2 and figures are presented."
    211       },
    212       "data_collection_described": {
    213         "applies": true,
    214         "answer": true,
    215         "justification": "The simulation setup is described in detail: 100 mobile nodes in 500x500m area, sensing radius 20m, communication range 50m, battery energy 100 units, energy cost k=0.5 units/meter, 5 runs with different random seeds, 200 episodes each with 10-second windows."
    216       },
    217       "recruitment_methods_described": {
    218         "applies": false,
    219         "answer": false,
    220         "justification": "No human participants are involved. This is a simulation study with synthetic sensor networks, so participant recruitment is not applicable."
    221       },
    222       "data_pipeline_documented": {
    223         "applies": true,
    224         "answer": true,
    225         "justification": "The processing pipeline is documented: camera captures images at 10-second intervals, CNN detects and localizes LED positions (95.2% IoU accuracy), sensor positions are used to compute coverage metrics, metrics are translated into reward signals for the RL framework. The flow from data collection to analysis is traceable."
    226       }
    227     },
    228     "conflicts_of_interest": {
    229       "funding_disclosed": {
    230         "applies": true,
    231         "answer": false,
    232         "justification": "No acknowledgments section, no funding sources, no grants, and no financial support are mentioned anywhere in the paper."
    233       },
    234       "affiliations_disclosed": {
    235         "applies": true,
    236         "answer": true,
    237         "justification": "All five authors are listed with their affiliation: School of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran."
    238       },
    239       "funder_independent_of_outcome": {
    240         "applies": false,
    241         "answer": false,
    242         "justification": "No funding source is disclosed. The work appears to be unfunded academic research from a university. The schema says NA if unfunded."
    243       },
    244       "financial_interests_declared": {
    245         "applies": true,
    246         "answer": false,
    247         "justification": "There is no competing interests statement, no patent disclosure, and no declaration of financial interests anywhere in the paper. Absence of disclosure does not satisfy this criterion."
    248       }
    249     },
    250     "contamination": {
    251       "training_cutoff_stated": {
    252         "applies": false,
    253         "answer": false,
    254         "justification": "This paper trains a DRL agent from scratch in a custom simulation environment. No pre-trained language model with a training data cutoff is evaluated on any benchmark."
    255       },
    256       "train_test_overlap_discussed": {
    257         "applies": false,
    258         "answer": false,
    259         "justification": "No pre-trained model with potential training data contamination is used. The DRL agent is trained online in the simulation environment."
    260       },
    261       "benchmark_contamination_addressed": {
    262         "applies": false,
    263         "answer": false,
    264         "justification": "No pre-trained model is evaluated on an existing benchmark. The evaluation uses a custom simulation, so benchmark contamination does not apply."
    265       }
    266     },
    267     "human_studies": {
    268       "pre_registered": {
    269         "applies": false,
    270         "answer": false,
    271         "justification": "No human participants are involved. This is a simulation study."
    272       },
    273       "irb_or_ethics_approval": {
    274         "applies": false,
    275         "answer": false,
    276         "justification": "No human participants are involved. IRB approval is not applicable."
    277       },
    278       "demographics_reported": {
    279         "applies": false,
    280         "answer": false,
    281         "justification": "No human participants are involved. Demographics are not applicable."
    282       },
    283       "inclusion_exclusion_criteria": {
    284         "applies": false,
    285         "answer": false,
    286         "justification": "No human participants are involved. Inclusion/exclusion criteria are not applicable."
    287       },
    288       "randomization_described": {
    289         "applies": false,
    290         "answer": false,
    291         "justification": "No human participants are involved. Randomization of participants is not applicable."
    292       },
    293       "blinding_described": {
    294         "applies": false,
    295         "answer": false,
    296         "justification": "No human participants are involved. Blinding is not applicable."
    297       },
    298       "attrition_reported": {
    299         "applies": false,
    300         "answer": false,
    301         "justification": "No human participants are involved. Attrition reporting is not applicable."
    302       }
    303     },
    304     "cost_and_practicality": {
    305       "inference_cost_reported": {
    306         "applies": true,
    307         "answer": false,
    308         "justification": "The paper discusses computational complexity in O-notation (Section III.E) but reports no actual wall-clock inference time, per-decision latency, GPU/CPU requirements, or cost figures for running the system."
    309       },
    310       "compute_budget_stated": {
    311         "applies": true,
    312         "answer": false,
    313         "justification": "No hardware specifications, GPU hours, training time, or total computational budget are stated. The paper mentions TensorFlow but provides no information about the hardware used or time required for training."
    314       }
    315     }
    316   }
    317 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs