scan.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan.json (26900B)
      1 {
      2   "paper": {
      3     "title": "Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning",
      4     "authors": ["Yun-Da Tsai", "Mingjie Liu", "Haoxing Ren"],
      5     "year": 2024,
      6     "venue": "arXiv preprint",
      7     "arxiv_id": "2407.05040"
      8   },
      9   "checklist": {
     10     "artifacts": {
     11       "code_released": {
     12         "applies": true,
     13         "answer": false,
     14         "justification": "No repository URL, GitHub link, or code archive is provided anywhere in the paper. The paper references public datasets but does not release its own implementation."
     15       },
     16       "data_released": {
     17         "applies": true,
     18         "answer": true,
     19         "justification": "The paper uses two publicly available datasets: Magicoder-OSS-Instruct-75K (MIT License) and Magicoder-Evol-Instruct-110K (Apache-2.0 License), both hosted on HuggingFace with URLs provided in Section 4.2. The evaluation benchmarks (HumanEval, MBPP, EvalPlus) are also public."
     20       },
     21       "environment_specified": {
     22         "applies": true,
     23         "answer": false,
     24         "justification": "The paper mentions 16 NVIDIA A100-80GB GPUs, PyTorch DDP, vLLM framework, and bf16 dtype (Section 4.2), but does not provide a requirements.txt, Dockerfile, or detailed dependency list with library versions sufficient to recreate the environment."
     25       },
     26       "reproduction_instructions": {
     27         "applies": true,
     28         "answer": false,
     29         "justification": "No step-by-step reproduction instructions or README with commands are provided. Algorithm 1 gives pseudocode for the pruning strategy, and training hyperparameters are stated (Section 4.1-4.2), but there are no specific scripts or commands a researcher could follow to reproduce the experiments."
     30       }
     31     },
     32     "statistical_methodology": {
     33       "confidence_intervals_or_error_bars": {
     34         "applies": true,
     35         "answer": false,
     36         "justification": "While the paper mentions standard deviation in the text ('we report the average and standard deviation' in Section 4.1), the main results in Table 1 report only averaged pass@1 values without confidence intervals or error bars. Figures 2 and 4 show shaded regions suggesting variance bands, but no CI notation or explicit ± values are given in tables."
     37       },
     38       "significance_tests": {
     39         "applies": true,
     40         "answer": false,
     41         "justification": "The paper claims improvements (e.g., 2.7% on HumanEval, 3.5% on MBPP) over full dataset training but does not report any statistical significance tests (no p-values, t-tests, or bootstrap tests). Claims of outperformance are based solely on comparing averaged point estimates."
     42       },
     43       "effect_sizes_reported": {
     44         "applies": true,
     45         "answer": true,
     46         "justification": "The paper reports percentage improvements with baseline context throughout. For example, Section 4.5: '2.7% on HumanEval and 3.5% on MBPP compared to training with the full dataset' and '3.9% on HumanEval and 1.5% on MBPP' degradation at 10% data. Table 1 provides raw scores for all methods, allowing readers to compute effect sizes."
     47       },
     48       "sample_size_justified": {
     49         "applies": true,
     50         "answer": false,
     51         "justification": "The paper uses 3 experimental runs per configuration (Section 4.1) but does not justify why 3 runs is sufficient. The Limitations section acknowledges 'Due to computational constraints, we only performed three runs' and that this 'may not fully capture the variability,' but no power analysis or formal justification is provided."
     52       },
     53       "variance_reported": {
     54         "applies": true,
     55         "answer": false,
     56         "justification": "Section 4.1 states 'we repeat each experiment 3 times and report the average and standard deviation,' and Figures 2, 4, 5 show shaded variance bands. However, Tables 1, 2, and 3 report only point estimates without any standard deviation values. The main numerical results lack explicit spread measures."
     57       }
     58     },
     59     "evaluation_design": {
     60       "baselines_included": {
     61         "applies": true,
     62         "answer": true,
     63         "justification": "The paper compares against multiple baselines: nocluster-random (no clustering, random selection), GPT-3.5 Turbo, GPT-4 Turbo, DeepSeek-Coder-Base/Instruct, Magicoder-DS, and MagicoderS-DS (Table 1). Ablation baselines include different clustering algorithms and pruning metrics."
     64       },
     65       "baselines_contemporary": {
     66         "applies": true,
     67         "answer": true,
     68         "justification": "Baselines include GPT-4 Turbo, DeepSeek-Coder (2024), MagicoderS-DS (2023), which are contemporary models at the time of writing. The paper uses state-of-the-art code LLMs as reference points."
     69       },
     70       "ablation_study": {
     71         "applies": true,
     72         "answer": true,
     73         "justification": "Section 5 presents four ablation studies: (1) clustering algorithms (KMeans, Agglomerative, HDBSCAN, no clustering — Figure 4), (2) pruning metrics (density, diversity, random — Figure 5), (3) effect of PCA (Table 2), and (4) embedding inputs (instruction, code, or both — Table 3)."
     74       },
     75       "multiple_metrics": {
     76         "applies": true,
     77         "answer": true,
     78         "justification": "The paper reports pass@1 on four benchmarks: HumanEval, HumanEval+, MBPP, and MBPP+ (Table 1). EvalPlus provides augmented test suites (80x and 35x more tests) for more rigorous evaluation."
     79       },
     80       "human_evaluation": {
     81         "applies": true,
     82         "answer": false,
     83         "justification": "No human evaluation is included. All evaluation is automated via pass@1 on test cases from HumanEval/MBPP benchmarks. Given that the paper claims improvements in 'overall quality code generation,' human assessment of code quality beyond functional correctness would be relevant."
     84       },
     85       "held_out_test_set": {
     86         "applies": true,
     87         "answer": true,
     88         "justification": "The evaluation benchmarks (HumanEval, MBPP, and their EvalPlus variants) are separate from the training data (Magicoder-OSS-Instruct-75K and Magicoder-Evol-Instruct-110K). The training and evaluation datasets are distinct."
     89       },
     90       "per_category_breakdown": {
     91         "applies": true,
     92         "answer": true,
     93         "justification": "Results are broken down across four individual benchmarks (HumanEval, HumanEval+, MBPP, MBPP+) in Table 1 and across different compression ratios (10%-90%) in Figures 2-5. This shows how performance varies across settings."
     94       },
     95       "failure_cases_discussed": {
     96         "applies": true,
     97         "answer": true,
     98         "justification": "Figure 8 in the Appendix shows an example of a removed data sample (an outlier that is a product management prompt, not a coding task). Section 5.3 discusses the negative impact of PCA on HumanEval performance (4.3% degradation). The Limitations section acknowledges variability issues."
     99       },
    100       "negative_results_reported": {
    101         "applies": true,
    102         "answer": true,
    103         "justification": "Several negative results are reported: PCA degrades HumanEval performance by 4.3% (Table 2, Section 5.3); pruning metrics show limited benefit beyond clustering alone (Section 5.2, Figure 5); aggressive pruning below 10% degrades HumanEval performance significantly (Figure 3)."
    104       }
    105     },
    106     "claims_and_evidence": {
    107       "abstract_claims_supported": {
    108         "applies": true,
    109         "answer": true,
    110         "justification": "The abstract claims: (1) benchmark performance preserved at 10% data — supported by Table 1 showing 70.4% vs 74.3% on HumanEval; (2) consistent improvements through moderate pruning — supported by 90% compression row showing 77.0% vs 74.3%; (3) reduced computational resources — supported by reduced training tokens (192M vs 234M). Claims are hedged appropriately ('largely preserved,' 'consistent improvements')."
    111       },
    112       "causal_claims_justified": {
    113         "applies": true,
    114         "answer": true,
    115         "justification": "The paper makes causal claims about pruning improving performance (e.g., 'moderate pruning of the training data... leading to improvements'). These are supported by controlled ablation studies in Section 5 where single variables are manipulated (clustering algorithm, pruning metric, PCA, embedding input) while holding others constant."
    116       },
    117       "generalization_bounded": {
    118         "applies": true,
    119         "answer": false,
    120         "justification": "The title and abstract make broad claims about 'LLM Fine-tuning for Code Generation' but all experiments use a single base model (DeepSeek-Coder-Base 6.7B), two synthetic training datasets, and Python-only benchmarks. The Potential Risks section notes 'This study focus exclusively on English prompts for Python code generation' but the title and abstract do not reflect these scope limitations."
    121       },
    122       "alternative_explanations_discussed": {
    123         "applies": true,
    124         "answer": false,
    125         "justification": "The paper does not discuss alternative explanations for why pruning improves performance. For instance, the improvement could be due to removing noisy/incorrect synthetic data rather than better sample diversity, or the batch size change at small data sizes could explain some results. No alternative explanations or confounds are considered."
    126       }
    127     },
    128     "setup_transparency": {
    129       "model_versions_specified": {
    130         "applies": true,
    131         "answer": false,
    132         "justification": "The paper specifies 'DeepSeek-Coder-Base 6.7B' and 'OpenAI text-embedding-ada-002' but does not provide specific version snapshots or API dates. For the compared models (GPT-3.5 Turbo, GPT-4 Turbo), no version identifiers are given — results are taken from prior work without specifying which model snapshot."
    133       },
    134       "prompts_provided": {
    135         "applies": false,
    136         "answer": false,
    137         "justification": "The paper fine-tunes a base model on instruction-code pairs and evaluates using EvalPlus scripts with existing benchmarks. There is no custom prompting of LLM APIs as part of the method — the approach is about data pruning for fine-tuning, not prompt engineering."
    138       },
    139       "hyperparameters_reported": {
    140         "applies": true,
    141         "answer": true,
    142         "justification": "Section 4.2 provides detailed hyperparameters: learning rate 5e-5, 15 warmup steps, linear scheduler, Adam optimizer, batch size 512 (or 32 for small datasets), 2 epochs, sequence length 4096. Section 4.3 specifies inference settings: bf16 dtype, tensor parallel size 2, max length 4096, greedy decoding. Section 4.4 reports PCA dimension of 10."
    143       },
    144       "scaffolding_described": {
    145         "applies": false,
    146         "answer": false,
    147         "justification": "No agentic scaffolding is used. The approach is a data pruning pipeline for fine-tuning, not an agent-based system."
    148       },
    149       "data_preprocessing_documented": {
    150         "applies": true,
    151         "answer": true,
    152         "justification": "Section 3 and Algorithm 1 document the full preprocessing pipeline: embedding via text-embedding-ada-002, PCA dimensionality reduction to 10 dimensions, clustering (HDBSCAN/KMeans/Agglomerative), and pruning metric application. Section 4.4 provides additional details including Scott's Rule for KDE bandwidth and the Elbow method for cluster selection."
    153       }
    154     },
    155     "limitations_and_scope": {
    156       "limitations_section_present": {
    157         "applies": true,
    158         "answer": true,
    159         "justification": "A dedicated 'Limitations' section (after Section 6) discusses two specific limitations: (1) only 3 runs due to computational constraints may not fully capture variability, and (2) batch size changes for small datasets and unexplored optimal hyperparameters."
    160       },
    161       "threats_to_validity_specific": {
    162         "applies": true,
    163         "answer": true,
    164         "justification": "The Limitations section discusses specific threats: 'Due to computational constraints, we only performed three runs and averaged the results... it may not fully capture the variability and could lead to less accurate conclusions.' It also notes the batch size change from 512 to 32 for small datasets could affect comparisons. The Potential Risks section discusses language and safety limitations."
    165       },
    166       "scope_boundaries_stated": {
    167         "applies": true,
    168         "answer": true,
    169         "justification": "The Potential Risks section explicitly states: 'This study focus exclusively on English prompts for Python code generation, thus prompts in other languages might not produce accurate or functional code.' The Limitations section notes that optimal hyperparameters for different data sizes are left as future work."
    170       }
    171     },
    172     "data_integrity": {
    173       "raw_data_available": {
    174         "applies": true,
    175         "answer": false,
    176         "justification": "While the training datasets are publicly available on HuggingFace, the pruned subsets, cluster assignments, and per-run experimental results are not released. There is no way to verify which specific data points were selected by the pruning algorithm."
    177       },
    178       "data_collection_described": {
    179         "applies": true,
    180         "answer": true,
    181         "justification": "Section 4.2 describes the training data: Magicoder-OSS-Instruct-75K and Magicoder-Evol-Instruct-110K, totaling 185K entries. These are synthetic code datasets with clearly stated licenses and HuggingFace URLs. Evaluation datasets (HumanEval with 164 problems, MBPP with 1401 problems) are described in Section 4.3."
    182       },
    183       "recruitment_methods_described": {
    184         "applies": false,
    185         "answer": false,
    186         "justification": "No human participants are involved. The data consists of standard public benchmarks and synthetic code datasets."
    187       },
    188       "data_pipeline_documented": {
    189         "applies": true,
    190         "answer": true,
    191         "justification": "Algorithm 1 and Section 3 document the full pipeline: raw instruction-code pairs → embedding → PCA reduction → clustering → pruning metric scoring → probabilistic selection. Section 4.4 provides implementation details including specific parameter choices."
    192       }
    193     },
    194     "conflicts_of_interest": {
    195       "funding_disclosed": {
    196         "applies": true,
    197         "answer": false,
    198         "justification": "No funding or acknowledgments section is present in the paper. All three authors are affiliated with NVIDIA, but no explicit funding disclosure is provided."
    199       },
    200       "affiliations_disclosed": {
    201         "applies": true,
    202         "answer": true,
    203         "justification": "All three authors are clearly listed with their NVIDIA affiliation and email addresses on the first page."
    204       },
    205       "funder_independent_of_outcome": {
    206         "applies": true,
    207         "answer": false,
    208         "justification": "All authors are NVIDIA employees. NVIDIA has a direct commercial interest in efficient LLM training methods, as it sells the GPU hardware (A100s) used in the experiments and competes in the AI/LLM market. The funder (NVIDIA as employer) is not independent of the outcome."
    209       },
    210       "financial_interests_declared": {
    211         "applies": true,
    212         "answer": false,
    213         "justification": "No competing interests statement or financial interests declaration is present in the paper."
    214       }
    215     },
    216     "contamination": {
    217       "training_cutoff_stated": {
    218         "applies": true,
    219         "answer": false,
    220         "justification": "The paper fine-tunes DeepSeek-Coder-Base 6.7B but does not state its training data cutoff date. This is relevant because HumanEval (published 2021) and MBPP (published 2021) could have been in the base model's pretraining data."
    221       },
    222       "train_test_overlap_discussed": {
    223         "applies": true,
    224         "answer": false,
    225         "justification": "No discussion of potential overlap between DeepSeek-Coder-Base's pretraining data and the HumanEval/MBPP benchmarks. Since the base model was likely trained on public code including these benchmarks, this is a relevant concern not addressed."
    226       },
    227       "benchmark_contamination_addressed": {
    228         "applies": true,
    229         "answer": false,
    230         "justification": "HumanEval (2021) and MBPP (2021) were published well before DeepSeek-Coder's training. The paper does not address whether these benchmarks were in the base model's training data, which could inflate baseline performance and affect the measured impact of pruning."
    231       }
    232     },
    233     "human_studies": {
    234       "pre_registered": {
    235         "applies": false,
    236         "answer": false,
    237         "justification": "No human participants are involved in this study."
    238       },
    239       "irb_or_ethics_approval": {
    240         "applies": false,
    241         "answer": false,
    242         "justification": "No human participants are involved in this study."
    243       },
    244       "demographics_reported": {
    245         "applies": false,
    246         "answer": false,
    247         "justification": "No human participants are involved in this study."
    248       },
    249       "inclusion_exclusion_criteria": {
    250         "applies": false,
    251         "answer": false,
    252         "justification": "No human participants are involved in this study."
    253       },
    254       "randomization_described": {
    255         "applies": false,
    256         "answer": false,
    257         "justification": "No human participants are involved in this study."
    258       },
    259       "blinding_described": {
    260         "applies": false,
    261         "answer": false,
    262         "justification": "No human participants are involved in this study."
    263       },
    264       "attrition_reported": {
    265         "applies": false,
    266         "answer": false,
    267         "justification": "No human participants are involved in this study."
    268       }
    269     },
    270     "cost_and_practicality": {
    271       "inference_cost_reported": {
    272         "applies": true,
    273         "answer": false,
    274         "justification": "The paper does not report inference cost or latency for the pruning pipeline itself (embedding + PCA + clustering + pruning). Table 2 reports runtime for KMeans with/without PCA (1307 sec vs 183 sec) but this is only for one configuration at 50% compression, and training/inference costs for the fine-tuned models are not reported."
    275       },
    276       "compute_budget_stated": {
    277         "applies": true,
    278         "answer": false,
    279         "justification": "The paper mentions 16 NVIDIA A100-80GB GPUs for training (Section 4.2) and reports training token counts (Table 1), but does not state total GPU hours, wall-clock training time, or API costs for the embedding step using text-embedding-ada-002."
    280       }
    281     }
    282   },
    283   "claims": [
    284     {
    285       "claim": "Benchmark performance can be largely preserved by training on only 10% of the data, with slight degradation of 3.9% on HumanEval and 1.5% on MBPP compared to using all data.",
    286       "evidence": "Table 1: Ours (10%) achieves 70.4% on HumanEval vs 74.3% full data, and 73.0% on MBPP vs 74.5% full data. Section 4.5 discusses these results.",
    287       "supported": "strong"
    288     },
    289     {
    290       "claim": "Moderate pruning (90% retention) consistently improves benchmark performance, with improvements of up to 2.7% on HumanEval and 3.5% on MBPP compared to full dataset training.",
    291       "evidence": "Table 1: Ours (90%) achieves 77.0% on HumanEval vs 74.3% full data (+2.7%), and 76.9% on MBPP vs 74.5% (+2.4%, though 3.5% is claimed for MBPP). Results averaged over 3 runs.",
    292       "supported": "moderate"
    293     },
    294     {
    295       "claim": "HDBSCAN-diversity method consistently outperforms the nocluster-random baseline across different compression ratios.",
    296       "evidence": "Figure 2 shows HDBSCAN-diversity above nocluster-random across all compression ratios on both MBPP and HumanEval benchmarks. Section 4.5 discusses this.",
    297       "supported": "moderate"
    298     },
    299     {
    300       "claim": "Clustering algorithm choice is critical, while pruning metrics have limited impact.",
    301       "evidence": "Section 5.1 (Figure 4) shows clustering algorithms improve over no-clustering baseline. Section 5.2 (Figure 5) shows diversity metric has only 'marginal advantage' and 'its benefit may be limited.'",
    302       "supported": "moderate"
    303     },
    304     {
    305       "claim": "Even with just 1% of the data (~700 samples), the method achieves a 4.1% improvement over the base model on MBPP.",
    306       "evidence": "Figure 3 and Table 1: Ours (1%) achieves 74.3% on MBPP vs base model 70.2%, a 4.1% improvement. Section 4.5 discusses this.",
    307       "supported": "strong"
    308     }
    309   ],
    310   "methodology_tags": ["benchmark-eval"],
    311   "key_findings": "Data pruning via HDBSCAN clustering with diversity-based selection can maintain or improve code generation performance while significantly reducing fine-tuning data. Training on 90% less data (10% retention) preserves most benchmark performance, with only 3.9% degradation on HumanEval and 1.5% on MBPP. Moderate pruning (10-20% compression) actually improves performance over full-data training by up to 2.7% on HumanEval and 3.5% on MBPP. Ablation studies show the clustering algorithm is the critical component, while the specific pruning metric has limited additional impact.",
    312   "red_flags": [
    313     {
    314       "flag": "No significance tests with small effect sizes",
    315       "detail": "Improvements of 2-3.5% are claimed over 3 runs without any statistical significance testing. Given the acknowledged randomness in clustering and training, these differences may not be statistically significant."
    316     },
    317     {
    318       "flag": "Single base model tested",
    319       "detail": "All experiments use only DeepSeek-Coder-Base 6.7B. Claims about 'LLM fine-tuning for code generation' are not tested on other model families or sizes, limiting generalizability."
    320     },
    321     {
    322       "flag": "Benchmark contamination not addressed",
    323       "detail": "HumanEval and MBPP were published in 2021, likely before DeepSeek-Coder's training cutoff. No discussion of whether the base model already memorized these benchmarks, which would affect the interpretation of pruning effects."
    324     },
    325     {
    326       "flag": "NVIDIA employees evaluating compute efficiency method",
    327       "detail": "All authors are NVIDIA employees presenting a method that reduces training compute while maintaining performance. NVIDIA has a commercial interest in demonstrating efficient GPU utilization. No conflict of interest statement is provided."
    328     },
    329     {
    330       "flag": "Variance not reported in tables",
    331       "detail": "Despite claiming 3-run averaging with standard deviation, Tables 1, 2, and 3 report only point estimates. The reader cannot assess result stability from the main results."
    332     },
    333     {
    334       "flag": "Batch size confound",
    335       "detail": "The batch size is changed from 512 to 32 for datasets below 10% of original size. This confounds the comparison of pruning levels, as the batch size change could independently affect training outcomes."
    336     }
    337   ],
    338   "cited_papers": [
    339     {
    340       "title": "DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence",
    341       "authors": ["Daya Guo", "Qihao Zhu", "Dejian Yang"],
    342       "year": 2024,
    343       "relevance": "Major open-source code LLM used as the base model; directly relevant to LLM code generation capabilities."
    344     },
    345     {
    346       "title": "Magicoder: Source Code is All You Need",
    347       "authors": ["Yuxiang Wei", "Zhe Wang", "Jiawei Liu", "Yifeng Ding", "Lingming Zhang"],
    348       "year": 2023,
    349       "arxiv_id": "2312.02120",
    350       "relevance": "Key baseline for code LLM fine-tuning with synthetic data; introduces OSS-Instruct technique used in this paper's training data."
    351     },
    352     {
    353       "title": "WizardCoder: Empowering Code Large Language Models with Evol-Instruct",
    354       "authors": ["Ziyang Luo", "Can Xu", "Pu Zhao"],
    355       "year": 2024,
    356       "relevance": "Baseline code LLM that uses Evol-Instruct for synthetic data generation; directly relevant to code generation quality."
    357     },
    358     {
    359       "title": "Evaluating Large Language Models Trained on Code",
    360       "authors": ["Mark Chen", "Jerry Tworek", "Heewoo Jun"],
    361       "year": 2021,
    362       "arxiv_id": "2107.03374",
    363       "relevance": "Introduces HumanEval benchmark, the primary evaluation metric used in code generation research."
    364     },
    365     {
    366       "title": "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation",
    367       "authors": ["Jiawei Liu", "Chunqiu Steven Xia", "Yuyao Wang", "Lingming Zhang"],
    368       "year": 2023,
    369       "relevance": "Introduces EvalPlus with augmented test suites (HumanEval+, MBPP+); directly relevant to code generation evaluation methodology."
    370     },
    371     {
    372       "title": "LIMA: Less is More for Alignment",
    373       "authors": ["Chunting Zhou", "Pengfei Liu", "Puxin Xu"],
    374       "year": 2023,
    375       "relevance": "Introduces the Superficial Alignment Hypothesis suggesting minimal fine-tuning data suffices; foundational to the data pruning motivation."
    376     },
    377     {
    378       "title": "LESS: Selecting Influential Data for Targeted Instruction Tuning",
    379       "authors": ["Mengzhou Xia", "Sadhika Malladi", "Suchin Gururangan", "Sanjeev Arora", "Danqi Chen"],
    380       "year": 2024,
    381       "arxiv_id": "2402.04333",
    382       "relevance": "Data selection method for instruction tuning using gradient-based influence metrics; directly relevant to efficient LLM training."
    383     },
    384     {
    385       "title": "Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning",
    386       "authors": ["Ben Sorscher", "Robert Geirhos", "Shashank Shekhar", "Surya Ganguli", "Ari Morcos"],
    387       "year": 2022,
    388       "relevance": "Foundational work on data pruning demonstrating that careful data selection can beat scaling laws; directly relevant methodology."
    389     },
    390     {
    391       "title": "Code Llama: Open Foundation Models for Code",
    392       "authors": ["Baptiste Roziere", "Jonas Gehring", "Fabian Gloeckle"],
    393       "year": 2023,
    394       "arxiv_id": "2308.12950",
    395       "relevance": "Major open-source code LLM built on LLaMA2; relevant to the landscape of code generation models."
    396     },
    397     {
    398       "title": "GPT-4 Technical Report",
    399       "authors": ["Josh Achiam", "Steven Adler", "Sandhini Agarwal"],
    400       "year": 2023,
    401       "arxiv_id": "2303.08774",
    402       "relevance": "Foundational LLM referenced for scaling laws and as a benchmark comparison point."
    403     },
    404     {
    405       "title": "AlpaGasus: Training a Better Alpaca with Fewer Data",
    406       "authors": ["Lichang Chen", "Shiyang Li", "Jun Yan"],
    407       "year": 2024,
    408       "relevance": "Data filtering method for instruction tuning using quality metrics; directly relevant to efficient fine-tuning approaches."
    409     },
    410     {
    411       "title": "LoRA: Low-Rank Adaptation of Large Language Models",
    412       "authors": ["Edward J Hu", "Yelong Shen", "Phillip Wallis"],
    413       "year": 2021,
    414       "arxiv_id": "2106.09685",
    415       "relevance": "Parameter-efficient fine-tuning method; relevant to efficient LLM training approaches."
    416     }
    417   ]
    418 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs