scan-v4.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan-v4.json (36420B)
      1 {
      2   "scan_version": 4,
      3   "paper_type": "empirical",
      4   "paper": {
      5     "title": "DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process",
      6     "authors": [
      7       "Minjun Zhu",
      8       "Yixuan Weng",
      9       "Linyi Yang",
     10       "Yue Zhang"
     11     ],
     12     "year": 2025,
     13     "venue": "Annual Meeting of the Association for Computational Linguistics",
     14     "arxiv_id": "2503.08569",
     15     "doi": "10.48550/arXiv.2503.08569"
     16   },
     17   "checklist": {
     18     "claims_and_evidence": {
     19       "abstract_claims_supported": {
     20         "applies": true,
     21         "answer": true,
     22         "justification": "Abstract claims of 88.21% and 80.20% win rates against GPT-o1 and DeepSeek-R1 match Table 4 (ICLR 2024, Overall Judgment). The claim that DeepReviewer-14B 'outperforms CycleReviewer-70B with fewer tokens' is supported by Tables 2-3 and Section 5.5's token comparison.",
     23         "source": "opus"
     24       },
     25       "causal_claims_justified": {
     26         "applies": true,
     27         "answer": true,
     28         "justification": "The main causal claims about reasoning stages improving performance are supported by controlled ablation through Fast/Standard/Best modes (Section 5.5), which progressively add reasoning stages. The attribution of adversarial robustness to 'multi-stage reasoning framework' (Section 5.4) is more speculative but hedged with 'We attribute.'",
     29         "source": "opus"
     30       },
     31       "generalization_bounded": {
     32         "applies": true,
     33         "answer": false,
     34         "justification": "The abstract claims to 'set a new benchmark for LLM-based paper review' broadly, but all experiments use only ICLR CS/ML papers. The title says 'Improving LLM-based Paper Review' without domain qualification. The limitations section mentions generalizability concerns but the main claims are not bounded to the tested setting.",
     35         "source": "opus"
     36       },
     37       "alternative_explanations_discussed": {
     38         "applies": true,
     39         "answer": false,
     40         "justification": "No alternative explanations for the results are discussed. Potential confounds are unaddressed: the Phi-4 base model's contribution vs. the training framework, whether the LLM judge (Gemini) has systematic biases, whether training data overlap with test data inflates scores, or whether CycleReviewer baselines were run under optimal conditions.",
     41         "source": "opus"
     42       },
     43       "proxy_outcome_distinction": {
     44         "applies": true,
     45         "answer": false,
     46         "justification": "The paper uses MSE/MAE against averaged ICLR review scores as a proxy for review quality and LLM-as-judge win rates as a proxy for review usefulness, without discussing the gap between these proxies and actual review quality. Whether matching aggregated human scores means a review is actually insightful or useful is not examined.",
     47         "source": "opus"
     48       }
     49     },
     50     "limitations_and_scope": {
     51       "limitations_section_present": {
     52         "applies": true,
     53         "answer": true,
     54         "justification": "A dedicated 'Limitations' section discusses three specific issues: synthetic data may not capture genuine human review nuances, computational intensity of 'Best' mode, and incomplete adversarial robustness.",
     55         "source": "opus"
     56       },
     57       "threats_to_validity_specific": {
     58         "applies": true,
     59         "answer": true,
     60         "justification": "The limitations are specific to this study: (1) 'synthetic data may not fully capture the complexities and nuances of genuine human paper review', (2) the 'Best' mode with 'complete reasoning chain and external knowledge retrieval, can be computationally intensive', (3) 'complete immunity is not yet achieved' for adversarial attacks.",
     61         "source": "opus"
     62       },
     63       "scope_boundaries_stated": {
     64         "applies": true,
     65         "answer": false,
     66         "justification": "The paper does not explicitly state what the results do NOT show. It does not bound results to ICLR/CS/ML papers despite testing only on that domain. The limitations discuss general issues but do not specify excluded settings, untested populations, or claims the authors are not making.",
     67         "source": "opus"
     68       }
     69     },
     70     "conflicts_of_interest": {
     71       "funding_disclosed": {
     72         "applies": true,
     73         "answer": true,
     74         "justification": "The footnote on page 1 states 'Supported by Research Center for Industries of the Future, Westlake University.'",
     75         "source": "opus"
     76       },
     77       "affiliations_disclosed": {
     78         "applies": true,
     79         "answer": true,
     80         "justification": "Author affiliations are listed: Zhejiang University, Westlake University, and University College London. The authors are not affiliated with any of the companies whose models they evaluate (OpenAI, Anthropic, Google, DeepSeek).",
     81         "source": "opus"
     82       },
     83       "funder_independent_of_outcome": {
     84         "applies": true,
     85         "answer": true,
     86         "justification": "The funder (Westlake University's Research Center for Industries of the Future) is an academic institution with no apparent commercial stake in the outcome of automated paper review systems.",
     87         "source": "opus"
     88       },
     89       "financial_interests_declared": {
     90         "applies": true,
     91         "answer": false,
     92         "justification": "No competing interests or financial interests statement is included in the paper. The absence of a disclosure statement means this criterion is not met.",
     93         "source": "opus"
     94       }
     95     },
     96     "scope_and_framing": {
     97       "key_terms_defined": {
     98         "applies": true,
     99         "answer": false,
    100         "justification": "'Review quality,' 'deep thinking,' and 'reliable evaluation' are used extensively without precise operational definitions; evaluation tasks (Score, Ranking, Selection) are operationally defined but what constitutes a 'good review' is never formally specified.",
    101         "source": "haiku"
    102       },
    103       "intended_contribution_clear": {
    104         "applies": true,
    105         "answer": true,
    106         "justification": "Three contributions are explicitly stated: DeepReview-13K dataset, DeepReviewer-14B model, and DeepReview-Bench benchmark.",
    107         "source": "haiku"
    108       },
    109       "engagement_with_prior_work": {
    110         "applies": true,
    111         "answer": true,
    112         "justification": "Related work section engages with prior LLM review systems (AgentReview, CycleReviewer, ReviewMT) and explains how DeepReview differs by adding structured multi-stage reasoning and evidence-based argumentation.",
    113         "source": "haiku"
    114       }
    115     }
    116   },
    117   "type_checklist": {
    118     "empirical": {
    119       "artifacts": {
    120         "code_released": {
    121           "applies": true,
    122           "answer": true,
    123           "justification": "The Resources section lists 'Code Repository: zhu-minjun/Researcher' and states 'The code, model, dataset and demo have be released in http://ai-researcher.net.' URLs are provided for models (DeepReviewer-7B, DeepReviewer-14B), dataset (DeepReview-13K), and a demo page.",
    124           "source": "opus"
    125         },
    126         "data_released": {
    127           "applies": true,
    128           "answer": true,
    129           "justification": "The Resources section lists 'Dataset: DeepReview-13K' as released, and the paper states it will be publicly available. The underlying data is also sourced from the publicly accessible OpenReview platform.",
    130           "source": "opus"
    131         },
    132         "environment_specified": {
    133           "applies": true,
    134           "answer": false,
    135           "justification": "The paper mentions '8x H100 80G GPUs with DeepSpeed + ZeRO3' and training hyperparameters (Section 4.3), but provides no requirements.txt, Dockerfile, or detailed library version specifications needed to recreate the environment.",
    136           "source": "opus"
    137         },
    138         "reproduction_instructions": {
    139           "applies": true,
    140           "answer": false,
    141           "justification": "No step-by-step reproduction instructions are provided in the paper. The methodology sections describe the pipeline conceptually but do not include specific commands, scripts, or a README-style guide to reproduce experiments.",
    142           "source": "opus"
    143         }
    144       },
    145       "statistical_methodology": {
    146         "confidence_intervals_or_error_bars": {
    147           "applies": true,
    148           "answer": false,
    149           "justification": "Tables 2, 3, and 4 report only point estimates (e.g., MSE, MAE, accuracy, win rates) with no confidence intervals, error bars, or ± notation anywhere in the paper.",
    150           "source": "opus"
    151         },
    152         "significance_tests": {
    153           "applies": true,
    154           "answer": false,
    155           "justification": "The paper makes many comparative claims (e.g., '44.80% reduction in Rating MSE', '6.04% improvement in Rating Spearman') but no statistical significance tests (p-values, t-tests, bootstrap tests) are reported for any comparison.",
    156           "source": "opus"
    157         },
    158         "effect_sizes_reported": {
    159           "applies": true,
    160           "answer": true,
    161           "justification": "The paper consistently reports improvements with baseline context, e.g., 'reduces Rating MSE by an average of 65.83%' (Section 5.2), 'improvements of 33.58% and 22.09%' in MSE and MAE vs CycleReviewer-70B (Section 5.2), and absolute numbers for both systems are visible in tables.",
    162           "source": "opus"
    163         },
    164         "sample_size_justified": {
    165           "applies": true,
    166           "answer": false,
    167           "justification": "The test set is 10% of the dataset (1,286 samples) split by random sampling. No power analysis or explicit justification for why this sample size is adequate for the claims made.",
    168           "source": "opus"
    169         },
    170         "variance_reported": {
    171           "applies": true,
    172           "answer": false,
    173           "justification": "No variance, standard deviation, or spread measures are reported anywhere. All results in Tables 2-4 and Figure 3 appear to be single-run point estimates.",
    174           "source": "opus"
    175         }
    176       },
    177       "evaluation_design": {
    178         "baselines_included": {
    179           "applies": true,
    180           "answer": true,
    181           "justification": "Extensive baselines are compared: prompt-based methods (AI Scientist, AgentReview) with multiple backbone LLMs (GPT-o1, Claude-3.5-sonnet, Gemini-2.0-Flash-Thinking, DeepSeek-V3, DeepSeek-R1) and fine-tuned baselines (CycleReviewer-8B, CycleReviewer-70B).",
    182           "source": "opus"
    183         },
    184         "baselines_contemporary": {
    185           "applies": true,
    186           "answer": true,
    187           "justification": "Baselines include GPT-o1-2024-12-17, Claude-3.5-sonnet-20241022, Gemini-2.0-Flash-Thinking-01-21, DeepSeek-R1, and CycleReviewer (ICLR 2025 submission). All are state-of-the-art models from 2024-2025.",
    188           "source": "opus"
    189         },
    190         "ablation_study": {
    191           "applies": true,
    192           "answer": true,
    193           "justification": "Section 5.5 presents test-time scaling analysis with Reasoning Path Scaling (Fast/Standard/Best modes, progressively adding reasoning stages z1/z2/z3) and Reviewer Scaling (R=1 to R=6), which function as ablations showing which components contribute to performance.",
    194           "source": "opus"
    195         },
    196         "multiple_metrics": {
    197           "applies": true,
    198           "answer": true,
    199           "justification": "Multiple metrics are used: Rating MSE, Rating MAE, Decision Accuracy, Decision F1, Rating Spearman correlation, and Pairwise Rating Accuracy (Table 2). Qualitative evaluation adds Constructive Value, Analytical Depth, Plausibility, Technical Accuracy, and Overall Judgment (Table 4).",
    200           "source": "opus"
    201         },
    202         "human_evaluation": {
    203           "applies": true,
    204           "answer": false,
    205           "justification": "The paper uses LLM-as-a-judge evaluation with Gemini-2.0-Flash-Thinking (Section 3.2, Table 4) rather than human evaluation. The appendix includes a qualitative case study comparing DeepReviewer's meta-review to real human reviews for one paper, but no systematic human evaluation of DeepReviewer's output quality is performed.",
    206           "source": "opus"
    207         },
    208         "held_out_test_set": {
    209           "applies": true,
    210           "answer": true,
    211           "justification": "Section 3.2 states 'we randomly sampled 10% (1.2K) of the dataset to create DeepReview-Bench.' Table 1 shows separate ICLR 2024 Test (652) and ICLR 2025 Test (634) sets distinct from training data.",
    212           "source": "opus"
    213         },
    214         "per_category_breakdown": {
    215           "applies": true,
    216           "answer": true,
    217           "justification": "Table 3 breaks down results by Soundness, Presentation, and Contribution dimensions. Table 4 breaks down by Constructive Value, Analytical Depth, Plausibility, Technical Accuracy, and Overall Judgment. Results are also separated by ICLR 2024 vs ICLR 2025.",
    218           "source": "opus"
    219         },
    220         "failure_cases_discussed": {
    221           "applies": true,
    222           "answer": false,
    223           "justification": "No qualitative error analysis or specific failure cases are shown. The paper mentions adversarial vulnerability (0.31-point rating increase under attack in Section 5.4) and Reviewer Scaling variability, but does not analyze specific examples where DeepReviewer produces poor or incorrect reviews.",
    224           "source": "opus"
    225         },
    226         "negative_results_reported": {
    227           "applies": true,
    228           "answer": false,
    229           "justification": "Every main comparison shows DeepReviewer winning. While Section 5.5 acknowledges performance variability in Reviewer Scaling when R≠4, this is presented as an expected artifact of training distribution, not a negative result. No failed approaches or configurations that underperformed are reported.",
    230           "source": "opus"
    231         }
    232       },
    233       "setup_transparency": {
    234         "model_versions_specified": {
    235           "applies": true,
    236           "answer": true,
    237           "justification": "Section 5.1 specifies 'GPT-o1-2024-12-17, Claude-3.5-sonnet-20241022, Gemini-2.0-Flash-Thinking-01-21, DeepSeek-V3, and DeepSeek-R1.' Section 4.2 specifies 'Qwen-2.5-72B-Instruct', 'Qwen-2.5-3B-Instruct', and 'Phi-4 14B' for the base model.",
    238           "source": "opus"
    239         },
    240         "prompts_provided": {
    241           "applies": true,
    242           "answer": true,
    243           "justification": "Appendix Figures 4, 5, 6, and 7 provide full system prompt text for the LLM-as-judge evaluation, review improvement, paper analysis, and reliability verification stages respectively. These are complete prompt texts, not just descriptions.",
    244           "source": "opus"
    245         },
    246         "hyperparameters_reported": {
    247           "applies": true,
    248           "answer": true,
    249           "justification": "Section 5.1 reports 'temperature of 0.4 with maximum input and output lengths set to 100K and 16,384 tokens.' Section 4.3 reports training: 'batch size of 16 and a learning rate of 5e-6', '23,500 steps', '256K context window using LongRoPE, with a 40K context window during training.'",
    250           "source": "opus"
    251         },
    252         "scaffolding_described": {
    253           "applies": true,
    254           "answer": true,
    255           "justification": "The multi-stage pipeline is described in detail in Section 4.2: Stage 1 uses Semantic Scholar API and OpenScholar for literature retrieval with ReRank for reordering; Stage 2 uses review reconstruction; Stage 3 uses Gemini for evidence analysis. The inference strategy (Section 4.3) describes how the three modes (Fast/Standard/Best) execute different subsets of the pipeline.",
    256           "source": "opus"
    257         },
    258         "data_preprocessing_documented": {
    259           "applies": true,
    260           "answer": false,
    261           "justification": "The paper describes collecting 18,976 submissions from OpenReview and converting with MinerU, with 'empty PDFs filtered during conversion' (footnote 2). The final dataset has 13,378 training + 1,286 test samples, leaving ~4,312 removed. The quality control mechanism is described conceptually but exact counts removed at each stage (PDF filtering, quality control failures) are not reported.",
    262           "source": "opus"
    263         }
    264       },
    265       "data_integrity": {
    266         "raw_data_available": {
    267           "applies": true,
    268           "answer": true,
    269           "justification": "The source data (ICLR reviews) is publicly available on OpenReview. The paper states DeepReview-13K will be released via ai-researcher.net (Resources section). The underlying ICLR review data is independently verifiable.",
    270           "source": "opus"
    271         },
    272         "data_collection_described": {
    273           "applies": true,
    274           "answer": true,
    275           "justification": "Section 3.1 describes: 'collected raw data from the OpenReview platform arXiv repository, gathering 18,976 paper submissions spanning two ICLR conference cycles (2024-2025).' Reviews include 'textual assessments (Strengths, Weaknesses, and Questions), interactive discussions from the rebuttal phase, and standardized scores.'",
    276           "source": "opus"
    277         },
    278         "recruitment_methods_described": {
    279           "applies": false,
    280           "answer": false,
    281           "justification": "No human participants are involved in the experiments. Data comes from publicly available ICLR submissions on OpenReview. The LLM-as-judge evaluation uses an automated model, not human evaluators.",
    282           "source": "opus"
    283         },
    284         "data_pipeline_documented": {
    285           "applies": true,
    286           "answer": false,
    287           "justification": "The pipeline stages are described (collect → convert → filter → construct reasoning chains → quality control) but exact counts at each stage are missing. Starting with 18,976 samples and ending with 14,664 (13,378 train + 1,286 test), approximately 4,312 were removed, but the paper does not break down how many were lost at PDF filtering vs. quality control.",
    288           "source": "opus"
    289         }
    290       },
    291       "contamination": {
    292         "training_cutoff_stated": {
    293           "applies": true,
    294           "answer": false,
    295           "justification": "The paper does not state the training data cutoff for Phi-4 (the base model for DeepReviewer-14B) or for any of the baseline models (GPT-o1, DeepSeek-R1, etc.). Since ICLR reviews are publicly available, these models may have seen the test data during pre-training.",
    296           "source": "opus"
    297         },
    298         "train_test_overlap_discussed": {
    299           "applies": true,
    300           "answer": false,
    301           "justification": "The paper does not discuss whether Phi-4's pre-training data includes ICLR 2024-2025 reviews from OpenReview. The test set (DeepReview-Bench) is split from the same ICLR data, and potential overlap with base model pre-training is not analyzed.",
    302           "source": "opus"
    303         },
    304         "benchmark_contamination_addressed": {
    305           "applies": true,
    306           "answer": false,
    307           "justification": "ICLR 2024 reviews were publicly available on OpenReview before the training cutoffs of Phi-4 and the baseline models. ICLR 2025 reviews may also have been available. The paper does not discuss this contamination risk.",
    308           "source": "opus"
    309         }
    310       },
    311       "human_studies": {
    312         "pre_registered": {
    313           "applies": false,
    314           "answer": false,
    315           "justification": "No human participants involved. All evaluations are automated (metrics computed on held-out data, LLM-as-judge).",
    316           "source": "opus"
    317         },
    318         "irb_or_ethics_approval": {
    319           "applies": false,
    320           "answer": false,
    321           "justification": "No human participants involved in the experiments. The paper has an ethical considerations section but does not involve human subjects research.",
    322           "source": "opus"
    323         },
    324         "demographics_reported": {
    325           "applies": false,
    326           "answer": false,
    327           "justification": "No human participants involved in the experiments.",
    328           "source": "opus"
    329         },
    330         "inclusion_exclusion_criteria": {
    331           "applies": false,
    332           "answer": false,
    333           "justification": "No human participants involved in the experiments.",
    334           "source": "opus"
    335         },
    336         "randomization_described": {
    337           "applies": false,
    338           "answer": false,
    339           "justification": "No human participants involved in the experiments.",
    340           "source": "opus"
    341         },
    342         "blinding_described": {
    343           "applies": false,
    344           "answer": false,
    345           "justification": "No human participants involved in the experiments.",
    346           "source": "opus"
    347         },
    348         "attrition_reported": {
    349           "applies": false,
    350           "answer": false,
    351           "justification": "No human participants involved in the experiments.",
    352           "source": "opus"
    353         }
    354       },
    355       "cost_and_practicality": {
    356         "inference_cost_reported": {
    357           "applies": true,
    358           "answer": false,
    359           "justification": "Section 5.5 reports approximate output token counts per mode (Fast ~3,000, Standard ~8,000, Best ~14,500) but no actual inference cost in dollars, wall-clock latency, or tokens-per-second throughput is reported. The Best mode's external API calls (Semantic Scholar, OpenScholar) add unquantified latency.",
    360           "source": "opus"
    361         },
    362         "compute_budget_stated": {
    363           "applies": true,
    364           "answer": false,
    365           "justification": "Section 4.3 mentions '8x H100 80G GPUs' and '23,500 steps with a batch size of 16' but does not report total training time in GPU-hours, wall-clock time, or total API costs for baseline evaluations.",
    366           "source": "opus"
    367         }
    368       },
    369       "experimental_rigor": {
    370         "seed_sensitivity_reported": {
    371           "applies": true,
    372           "answer": false,
    373           "justification": "No mention of multiple random seeds or seed sensitivity analysis. All results appear to be from single runs.",
    374           "source": "opus"
    375         },
    376         "number_of_runs_stated": {
    377           "applies": true,
    378           "answer": false,
    379           "justification": "The paper does not state how many runs produced the results in Tables 2-4 or Figure 3. It is unclear whether results are from single runs or averaged.",
    380           "source": "opus"
    381         },
    382         "hyperparameter_search_budget": {
    383           "applies": true,
    384           "answer": false,
    385           "justification": "Training hyperparameters (lr=5e-6, batch=16, 23,500 steps) are reported but there is no mention of how many configurations were tried, what search method was used, or whether any hyperparameter tuning was performed.",
    386           "source": "opus"
    387         },
    388         "best_config_selection_justified": {
    389           "applies": true,
    390           "answer": false,
    391           "justification": "No discussion of how the final training configuration was selected. The paper presents one configuration without explaining whether alternatives were tried or how this particular setup was chosen.",
    392           "source": "opus"
    393         },
    394         "multiple_comparison_correction": {
    395           "applies": true,
    396           "answer": false,
    397           "justification": "The paper makes dozens of comparisons across multiple metrics, models, and datasets (Tables 2-4) but performs no statistical tests at all, let alone corrections for multiple comparisons.",
    398           "source": "opus"
    399         },
    400         "self_comparison_bias_addressed": {
    401           "applies": true,
    402           "answer": false,
    403           "justification": "The authors implemented DeepReviewer and designed the evaluation framework (DeepReview-Bench) but do not acknowledge self-comparison bias. They also designed the data construction pipeline and the evaluation metrics, creating potential for systematic bias in their favor.",
    404           "source": "opus"
    405         },
    406         "compute_budget_vs_performance": {
    407           "applies": true,
    408           "answer": true,
    409           "justification": "Section 5.5 and Figure 3 explicitly show performance as a function of inference token count (compute). The paper notes that 'DeepReviewer's Fast mode, with only half the output tokens (3000), outperformed the CycleReviewer model (6000 output tokens) across various metrics.'",
    410           "source": "opus"
    411         },
    412         "benchmark_construct_validity": {
    413           "applies": true,
    414           "answer": false,
    415           "justification": "The paper uses ICLR review score prediction as a benchmark for review quality without discussing whether predicting aggregated reviewer scores actually measures the ability to produce useful reviews. Whether lower MSE against averaged human scores means better review quality (vs. just matching the central tendency) is not examined.",
    416           "source": "opus"
    417         },
    418         "scaffold_confound_addressed": {
    419           "applies": true,
    420           "answer": false,
    421           "justification": "Different baselines use fundamentally different scaffolding: AI Scientist uses agentic prompting, AgentReview uses multi-agent simulation, CycleReviewer is direct fine-tuning, and DeepReviewer has a multi-stage pipeline with external retrieval. Performance differences are attributed to the model/method without controlling for scaffolding differences.",
    422           "source": "opus"
    423         }
    424       },
    425       "data_leakage": {
    426         "temporal_leakage_addressed": {
    427           "applies": true,
    428           "answer": false,
    429           "justification": "The base model Phi-4 and baseline models (GPT-o1, DeepSeek-R1) may have been pre-trained on ICLR 2024 review data from OpenReview, which predates their release. This temporal leakage risk is not discussed.",
    430           "source": "opus"
    431         },
    432         "feature_leakage_addressed": {
    433           "applies": true,
    434           "answer": false,
    435           "justification": "No discussion of whether the evaluation setup leaks information. The training and test data are from the same conferences (ICLR 2024-2025) and the model is trained to predict scores from the same review platform.",
    436           "source": "opus"
    437         },
    438         "non_independence_addressed": {
    439           "applies": true,
    440           "answer": false,
    441           "justification": "Training and test data are randomly split from the same pool of ICLR 2024-2025 submissions. Papers by the same authors, on the same topics, or from the same research groups could appear in both splits. This non-independence is not discussed.",
    442           "source": "opus"
    443         },
    444         "leakage_detection_method": {
    445           "applies": true,
    446           "answer": false,
    447           "justification": "No leakage detection or prevention methods are employed. No canary strings, membership inference tests, decontamination pipelines, or overlap analysis between Phi-4's pre-training data and the ICLR test set.",
    448           "source": "opus"
    449         }
    450       }
    451     }
    452   },
    453   "claims": [
    454     {
    455       "claim": "DeepReviewer-14B reduces Rating MSE by 44.80% compared to CycleReviewer-70B on the DeepReview-Bench test set.",
    456       "evidence": "Table 2 shows DeepReviewer-14B achieving MSE of 1.3137/1.3410 vs CycleReviewer-70B at 2.4870/2.4294 on ICLR 2024/2025.",
    457       "supported": "strong"
    458     },
    459     {
    460       "claim": "DeepReviewer-14B achieves 88.21% and 80.20% win rates against AI Scientist (GPT-o1) and AI Scientist (DeepSeek-R1) respectively in LLM-as-judge evaluation.",
    461       "evidence": "Table 4 confirms these exact win rates for ICLR 2024, with Gemini-2.0-Flash-Thinking as the judge.",
    462       "supported": "moderate"
    463     },
    464     {
    465       "claim": "DeepReviewer demonstrates strong resilience to adversarial attacks despite no explicit adversarial training.",
    466       "evidence": "Figure 2 shows DeepReviewer's rating increases only 0.31 points under attack vs 4.26 for Gemini and 1.41 for DeepSeek-V3.",
    467       "supported": "moderate"
    468     },
    469     {
    470       "claim": "Test-time scaling (more reasoning tokens) consistently improves review quality across metrics.",
    471       "evidence": "Figure 3 shows positive regression trends for both Reasoning Path Scaling and Reviewer Scaling across metrics, though with variability.",
    472       "supported": "moderate"
    473     },
    474     {
    475       "claim": "DeepReviewer-14B with 14B parameters outperforms CycleReviewer-70B with 70B parameters, demonstrating parameter efficiency.",
    476       "evidence": "Table 2 shows DeepReviewer-14B dominating CycleReviewer-70B on all metrics; this holds across ICLR 2024 and 2025.",
    477       "supported": "strong"
    478     },
    479     {
    480       "claim": "The DeepReview-13K synthetic dataset captures the intermediate reasoning processes of academic paper reviews.",
    481       "evidence": "Dataset construction pipeline described in Section 4.2 using Qwen-2.5-72B-Instruct and Gemini-2.0-Flash-Thinking, but no human validation of whether synthetic reasoning matches expert reasoning is provided.",
    482       "supported": "weak"
    483     }
    484   ],
    485   "methodology_tags": [
    486     "benchmark-eval",
    487     "observational"
    488   ],
    489   "key_findings": "DeepReviewer-14B, trained on a synthetic dataset of 13,378 structured ICLR paper reviews, substantially outperforms larger models including CycleReviewer-70B and GPT-o1 on automated rating prediction (44.80% MSE reduction) and LLM-judged review quality (88.21% win rate vs GPT-o1). The multi-stage reasoning framework shows test-time scaling behavior where increasing reasoning depth (Fast→Standard→Best modes) and simulated reviewer count improve performance. However, the evaluation relies entirely on LLM-as-judge without human validation, the model trains and tests on the same ICLR conference distribution raising distribution-match concerns, and the base model's potential pretraining overlap with test papers is unaddressed.",
    490   "red_flags": [
    491     {
    492       "flag": "Train/test distribution overlap",
    493       "detail": "Both training data (DeepReview-13K) and test data (DeepReview-Bench) come from ICLR 2024/2025 papers; the base model Phi-4 may have seen test papers in pretraining. No contamination analysis is performed."
    494     },
    495     {
    496       "flag": "LLM-as-judge circularity",
    497       "detail": "Gemini-2.0-Flash-Thinking is used as the sole judge for qualitative evaluation while also being one of the systems compared against. No human validation of review quality is conducted."
    498     },
    499     {
    500       "flag": "No statistical significance testing",
    501       "detail": "All comparative results are point estimates with no confidence intervals, error bars, or significance tests, making it impossible to assess whether differences are meaningful."
    502     },
    503     {
    504       "flag": "Synthetic training data validity",
    505       "detail": "DeepReview-13K is generated by Qwen-2.5-72B-Instruct and Gemini-2.0-Flash-Thinking, not actual expert reviewers. The model learns to mimic LLM-generated review patterns, not genuine expert reasoning."
    506     },
    507     {
    508       "flag": "Missing component ablation",
    509       "detail": "No ablation isolates the individual contribution of z1 (novelty verification), z2 (multi-dimension review), and z3 (reliability verification); the Fast/Standard/Best comparison confounds multiple changes simultaneously."
    510     },
    511     {
    512       "flag": "Reward model overoptimization unaddressed",
    513       "detail": "The model is trained and evaluated on the same rating signal (ICLR reviewer average scores); reward hacking and overoptimization of the CycleReviewer metric are not investigated despite being a known issue in this literature."
    514     }
    515   ],
    516   "cited_papers": [
    517     {
    518       "title": "CycleResearcher: Improving Automated Research via Automated Review",
    519       "relevance": "Direct predecessor work training review models on ICLR data; DeepReviewer directly compares against CycleReviewer-70B as primary baseline."
    520     },
    521     {
    522       "title": "AgentReview: Exploring Peer Review Dynamics with LLM Agents",
    523       "relevance": "Agent-based peer review simulation; used as baseline in DeepReview experiments."
    524     },
    525     {
    526       "title": "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery",
    527       "relevance": "Automated scientific research system used as a baseline for paper review generation."
    528     },
    529     {
    530       "title": "OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LMs",
    531       "relevance": "Literature retrieval system used in DeepReview's novelty verification stage."
    532     },
    533     {
    534       "title": "Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review",
    535       "relevance": "Examines adversarial vulnerabilities in LLM-based review systems; motivates DeepReview's robustness evaluation."
    536     },
    537     {
    538       "title": "Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews",
    539       "relevance": "Empirical study of LLM influence on peer review; provides context for automated review research."
    540     },
    541     {
    542       "title": "PeerRead: A Dataset of Peer Reviews",
    543       "relevance": "Early benchmark dataset for automated paper review tasks; foundational dataset in the area."
    544     },
    545     {
    546       "title": "Peer Review as a Multi-Turn and Long-Context Dialogue with Role-Based Interactions (ReviewMT)",
    547       "relevance": "Alternative approach to LLM-based review training; directly compared with DeepReview approach."
    548     },
    549     {
    550       "title": "Large Language Models for Automated Scholarly Paper Review: A Survey",
    551       "relevance": "Survey of LLM-based paper review methods providing landscape context for DeepReview's contributions."
    552     }
    553   ],
    554   "engagement_factors": {
    555     "practical_relevance": {
    556       "score": 2,
    557       "justification": "Released model and demo at ai-researcher.net/deepreviewer that researchers could use for self-assessment, though requires significant compute for the 14B model."
    558     },
    559     "surprise_contrarian": {
    560       "score": 1,
    561       "justification": "A 14B model outperforming 70B models and GPT-o1 is mildly surprising, but the trend of smaller fine-tuned models beating larger general models is well-established."
    562     },
    563     "fear_safety": {
    564       "score": 1,
    565       "justification": "Raises concerns about LLMs automating peer review and potential manipulation, but the paper explicitly positions itself as augmenting rather than replacing human reviewers."
    566     },
    567     "drama_conflict": {
    568       "score": 1,
    569       "justification": "Touches on the controversial topic of LLMs in peer review (ICLR 2025 introduced LLM assistance), but the paper avoids taking an adversarial position."
    570     },
    571     "demo_ability": {
    572       "score": 2,
    573       "justification": "Live demo available at ai-researcher.net/deepreviewer with released models (DeepReviewer-7B, 14B) and code repository."
    574     },
    575     "brand_recognition": {
    576       "score": 1,
    577       "justification": "Westlake University and UCL are recognized institutions but not in the top tier of AI lab brand recognition; published at ACL."
    578     }
    579   },
    580   "hn_data": {
    581     "threads": [
    582       {
    583         "hn_id": "45967079",
    584         "title": "Show HN: Browser-based interactive 3D Three-Body problem simulator",
    585         "points": 249,
    586         "comments": 113,
    587         "url": "https://news.ycombinator.com/item?id=45967079"
    588       },
    589       {
    590         "hn_id": "36349110",
    591         "title": "The Distributed Tensor Algebra Compiler (2022)",
    592         "points": 40,
    593         "comments": 6,
    594         "url": "https://news.ycombinator.com/item?id=36349110"
    595       },
    596       {
    597         "hn_id": "45202421",
    598         "title": "UGMM-NN: Univariate Gaussian Mixture Model Neural Network",
    599         "points": 31,
    600         "comments": 12,
    601         "url": "https://news.ycombinator.com/item?id=45202421"
    602       },
    603       {
    604         "hn_id": "44250248",
    605         "title": "Thermal Detection of People with Mobility Restrictions for Barrier Reduction",
    606         "points": 4,
    607         "comments": 0,
    608         "url": "https://news.ycombinator.com/item?id=44250248"
    609       },
    610       {
    611         "hn_id": "41139898",
    612         "title": "End-to-End Amp Modeling: From Data to Controllable Guitar Amplifier Models",
    613         "points": 2,
    614         "comments": 2,
    615         "url": "https://news.ycombinator.com/item?id=41139898"
    616       },
    617       {
    618         "hn_id": "43772236",
    619         "title": "Hands-On: Segmenting Individual Signs from Continuous Sequences",
    620         "points": 1,
    621         "comments": 0,
    622         "url": "https://news.ycombinator.com/item?id=43772236"
    623       }
    624     ],
    625     "top_points": 249,
    626     "total_points": 327,
    627     "total_comments": 133
    628   }
    629 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs