scan-v5.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan-v5.json (26298B)
      1 {
      2   "scan_version": 5,
      3   "paper_type": "empirical",
      4   "paper": {
      5     "title": "Instruction Tuning of Large Language Models for Tabular Data Generation—in One Day",
      6     "authors": [
      7       "Milad Abdollahzadeh",
      8       "Abdul Raheem",
      9       "Zilong Zhao",
     10       "Uzair Javaid",
     11       "Kevin Yee",
     12       "Nalam Venkata Abhishek",
     13       "Tram Truong-Huu",
     14       "Biplab Sikdar"
     15     ],
     16     "year": 2025,
     17     "venue": "ICML 2025 (PMLR 267)",
     18     "arxiv_id": "2511.23220",
     19     "doi": "10.48550/arXiv.2511.23220"
     20   },
     21   "checklist": {
     22     "claims_and_evidence": {
     23       "abstract_claims_supported": {
     24         "applies": true,
     25         "answer": false,
     26         "justification": "The abstract claims performance 'on par with GPT-4o,' but Tables 1 and 2 show GPT-4o systematically outperforms ITT-GEN on most metrics (e.g., adult Shape: 92.34 vs 85.73; bank Shape: 93.42 vs 85.57; adult utility AUC: 0.873 vs 0.826). The framing is generous relative to the actual data.",
     27         "source": "haiku"
     28       },
     29       "causal_claims_justified": {
     30         "applies": true,
     31         "answer": true,
     32         "justification": "The paper claims instruction tuning causes improvement in tabular generation; this is supported by direct before/after comparisons of the same base model (Llama3.1-8B-Instruct with vs. without fine-tuning), which is adequate for the causal claim.",
     33         "source": "haiku"
     34       },
     35       "generalization_bounded": {
     36         "applies": true,
     37         "answer": false,
     38         "justification": "The paper draws broad conclusions about instruction tuning for tabular data generation based on 20 datasets, mostly small benchmarks. No explicit scope boundary is stated about dataset size, domain, or generalizability to other table types.",
     39         "source": "haiku"
     40       },
     41       "alternative_explanations_discussed": {
     42         "applies": true,
     43         "answer": false,
     44         "justification": "The paper does not consider alternative explanations for observed improvements — for example, that performance gains could be due to the specific instruction format, metadata richness, or dataset selection rather than instruction tuning per se.",
     45         "source": "haiku"
     46       },
     47       "proxy_outcome_distinction": {
     48         "applies": true,
     49         "answer": true,
     50         "justification": "The paper explicitly defines and distinguishes fidelity metrics (Shape for marginal distributions, Trend for inter-column correlations) from utility metrics (TSTR AUC/R2), making clear what is actually measured vs. the broader claim of 'data quality.'",
     51         "source": "haiku"
     52       }
     53     },
     54     "limitations_and_scope": {
     55       "limitations_section_present": {
     56         "applies": true,
     57         "answer": false,
     58         "justification": "Limitations are only mentioned briefly in the introduction to frame the resource constraints addressed; there is no dedicated limitations or threats-to-validity section in the paper.",
     59         "source": "haiku"
     60       },
     61       "threats_to_validity_specific": {
     62         "applies": true,
     63         "answer": false,
     64         "justification": "No specific threats to validity are discussed — issues like dataset selection bias, sensitivity to instruction format, single-run results without variance, or limited domain coverage are not addressed.",
     65         "source": "haiku"
     66       },
     67       "scope_boundaries_stated": {
     68         "applies": true,
     69         "answer": false,
     70         "justification": "The paper does not explicitly state what its results do NOT show — for example, that findings may not hold for larger tables, more complex schemas, or domains not represented in the 20 training datasets.",
     71         "source": "haiku"
     72       }
     73     },
     74     "conflicts_of_interest": {
     75       "funding_disclosed": {
     76         "applies": true,
     77         "answer": false,
     78         "justification": "No funding disclosure or acknowledgment section is present in the paper. Multiple authors are affiliated with Betterdata AI, a commercial data synthesis company, without disclosing whether this work was commercially funded.",
     79         "source": "haiku"
     80       },
     81       "affiliations_disclosed": {
     82         "applies": true,
     83         "answer": true,
     84         "justification": "Author affiliations are clearly listed on the first page: Singapore Institute of Technology (SIT), Betterdata AI, and National University of Singapore (NUS).",
     85         "source": "haiku"
     86       },
     87       "funder_independent_of_outcome": {
     88         "applies": true,
     89         "answer": false,
     90         "justification": "At least one author (corresponding author milad@betterdata.ai) is affiliated with Betterdata AI, a commercial tabular data synthesis company, directly benefiting from favorable results in this area. No conflict of interest statement is present.",
     91         "source": "haiku"
     92       },
     93       "financial_interests_declared": {
     94         "applies": true,
     95         "answer": false,
     96         "justification": "No competing interests or financial interests declaration appears in the paper. The commercial affiliation of multiple authors with Betterdata AI is not disclosed as a potential financial interest.",
     97         "source": "haiku"
     98       }
     99     },
    100     "scope_and_framing": {
    101       "key_terms_defined": {
    102         "applies": true,
    103         "answer": true,
    104         "justification": "Key terms are defined: 'tabular instruction tuning' is described with reference to prior work, 'fidelity' (Shape and Trend metrics) and 'utility' (TSTR framework) are explicitly defined in Section 5.1.",
    105         "source": "haiku"
    106       },
    107       "intended_contribution_clear": {
    108         "applies": true,
    109         "answer": true,
    110         "justification": "The paper explicitly claims three contributions: first exploration of instruction tuning for tabular data generation, creation of a 10K instruction dataset, and experimental validation showing competitive performance with GPT-4o using limited resources.",
    111         "source": "haiku"
    112       },
    113       "engagement_with_prior_work": {
    114         "applies": true,
    115         "answer": true,
    116         "justification": "Section 2 covers related work on tabular instruction tuning (TableLLM, TableLlama, TAMA) and tabular data generation (GANs, VAEs, diffusion models, LLM-based methods), showing how this work fills the gap between them.",
    117         "source": "haiku"
    118       }
    119     }
    120   },
    121   "type_checklist": {
    122     "empirical": {
    123       "artifacts": {
    124         "code_released": {
    125           "applies": true,
    126           "answer": false,
    127           "justification": "No code repository or release is mentioned anywhere in the paper.",
    128           "source": "haiku"
    129         },
    130         "data_released": {
    131           "applies": true,
    132           "answer": false,
    133           "justification": "The paper creates a novel 10K instruction dataset but provides no link or release for it. The underlying tabular datasets are publicly available, but the constructed instruction dataset (the actual contribution) is not released.",
    134           "source": "haiku"
    135         },
    136         "environment_specified": {
    137           "applies": true,
    138           "answer": false,
    139           "justification": "The paper mentions HuggingFace Transformers, DeepSpeed ZeRO-2, and an A100 80GB GPU, but no requirements file, Dockerfile, or package versions are provided.",
    140           "source": "haiku"
    141         },
    142         "reproduction_instructions": {
    143           "applies": true,
    144           "answer": false,
    145           "justification": "No step-by-step reproduction instructions are provided; the method is described at a high level but lacks the specificity needed to reproduce the experiments without guessing.",
    146           "source": "haiku"
    147         }
    148       },
    149       "statistical_methodology": {
    150         "confidence_intervals_or_error_bars": {
    151           "applies": true,
    152           "answer": false,
    153           "justification": "No confidence intervals or error bars are reported in Tables 1–5. All results are presented as single point estimates.",
    154           "source": "haiku"
    155         },
    156         "significance_tests": {
    157           "applies": true,
    158           "answer": false,
    159           "justification": "No statistical significance tests are conducted for any comparative claims despite multiple numerical comparisons across 20 datasets.",
    160           "source": "haiku"
    161         },
    162         "effect_sizes_reported": {
    163           "applies": true,
    164           "answer": true,
    165           "justification": "Effect sizes are implicit in the reported metric values (e.g., Shape scores improving from ~55 to ~84 for breast cancer), allowing readers to assess magnitude of improvements.",
    166           "source": "haiku"
    167         },
    168         "sample_size_justified": {
    169           "applies": true,
    170           "answer": false,
    171           "justification": "The choice of 7K training instructions and 20 datasets is not justified by power analysis or formal rationale; it is motivated only by resource efficiency rather than statistical adequacy.",
    172           "source": "haiku"
    173         },
    174         "variance_reported": {
    175           "applies": true,
    176           "answer": false,
    177           "justification": "No variance, standard deviation, or results across multiple training runs are reported. It is unclear whether experiments were run once or multiple times.",
    178           "source": "haiku"
    179         }
    180       },
    181       "evaluation_design": {
    182         "baselines_included": {
    183           "applies": true,
    184           "answer": true,
    185           "justification": "Two baselines are included: the untuned base LLM (Llama3.1-8B-Instruct) and GPT-4o as a strong commercial upper bound.",
    186           "source": "haiku"
    187         },
    188         "baselines_contemporary": {
    189           "applies": true,
    190           "answer": false,
    191           "justification": "The paper compares only against the base LLM and GPT-4o, omitting contemporary dedicated tabular synthesis methods (CTAB-GAN+, GReaT/Tabula, TabDiff, TTVAE) that directly compete in this task — making the 'on par' conclusion unjustifiable without those comparisons.",
    192           "source": "haiku"
    193         },
    194         "ablation_study": {
    195           "applies": true,
    196           "answer": false,
    197           "justification": "No ablation study is presented. The paper claims metadata quality is important ('preliminary experimental results suggest the importance of high-quality metadata') but does not show an ablation table comparing with/without metadata or with/without different components.",
    198           "source": "haiku"
    199         },
    200         "multiple_metrics": {
    201           "applies": true,
    202           "answer": true,
    203           "justification": "Multiple metrics are used: Shape and Trend for fidelity, and TSTR (Train-on-Synthetic-Test-on-Real) with three ML models (linear, random forest, XGBoost) for utility.",
    204           "source": "haiku"
    205         },
    206         "human_evaluation": {
    207           "applies": false,
    208           "answer": false,
    209           "justification": "Human evaluation is not applicable for tabular data generation evaluated via distributional and utility metrics.",
    210           "source": "haiku"
    211         },
    212         "held_out_test_set": {
    213           "applies": true,
    214           "answer": true,
    215           "justification": "Six datasets are held out as out-of-domain (OoD) evaluation sets not used during training; 14 datasets have separate train/evaluation splits (Section 4.1).",
    216           "source": "haiku"
    217         },
    218         "per_category_breakdown": {
    219           "applies": true,
    220           "answer": true,
    221           "justification": "Results are broken down per dataset across all 20 tables in Tables 1–5, allowing dataset-level comparison.",
    222           "source": "haiku"
    223         },
    224         "failure_cases_discussed": {
    225           "applies": true,
    226           "answer": false,
    227           "justification": "The base LLM failure (generating irrelevant text ~80% of the time) is discussed, but there is no analysis of when or why ITT-GEN fails relative to GPT-4o, or for which data characteristics performance degrades.",
    228           "source": "haiku"
    229         },
    230         "negative_results_reported": {
    231           "applies": true,
    232           "answer": true,
    233           "justification": "The utility table shows ITT-GEN (0.543) underperforming the base LLM (0.685) on Tour & Travels Customer Churn, and the TableLlama appendix results show ITT-GEN still considerably below GPT-4o on several datasets.",
    234           "source": "haiku"
    235         }
    236       },
    237       "setup_transparency": {
    238         "model_versions_specified": {
    239           "applies": true,
    240           "answer": true,
    241           "justification": "Llama3.1-8B-Instruct is cited with Grattafiori et al. (2024) and GPT-4o with Hurst et al. (2024), providing sufficient specificity for the models used.",
    242           "source": "haiku"
    243         },
    244         "prompts_provided": {
    245           "applies": true,
    246           "answer": true,
    247           "justification": "The actual instruction template used for tabular data generation (Figure 2) and the GPT-4o metadata generation template (Figure 3) are both fully shown in the supplementary.",
    248           "source": "haiku"
    249         },
    250         "hyperparameters_reported": {
    251           "applies": true,
    252           "answer": true,
    253           "justification": "Learning rate (2e-5), batch size (3), epochs (2), hardware (A100 80GB), and optimizer configuration (DeepSpeed ZeRO-2) are all specified in Section 5.1.",
    254           "source": "haiku"
    255         },
    256         "scaffolding_described": {
    257           "applies": false,
    258           "answer": false,
    259           "justification": "No agentic scaffolding is used; this is a standard fine-tuning and inference setup.",
    260           "source": "haiku"
    261         },
    262         "data_preprocessing_documented": {
    263           "applies": true,
    264           "answer": true,
    265           "justification": "The instruction dataset construction process is described: 500 training + 100 evaluation instances per dataset, random row sampling (N=20), metadata generation via GPT-4o with manual review, and train/test split rationale.",
    266           "source": "haiku"
    267         }
    268       },
    269       "data_integrity": {
    270         "raw_data_available": {
    271           "applies": true,
    272           "answer": false,
    273           "justification": "The novel instruction dataset created by the authors is not released. While the underlying tabular datasets are publicly available, the constructed instructions (the actual contribution) cannot be independently verified.",
    274           "source": "haiku"
    275         },
    276         "data_collection_described": {
    277           "applies": true,
    278           "answer": true,
    279           "justification": "Table 3 lists all 20 datasets with their topics, row/column counts, and train/test assignment. The instruction construction process is described in Section 4.1.",
    280           "source": "haiku"
    281         },
    282         "recruitment_methods_described": {
    283           "applies": false,
    284           "answer": false,
    285           "justification": "No human participants; standard public benchmark datasets are used.",
    286           "source": "haiku"
    287         },
    288         "data_pipeline_documented": {
    289           "applies": true,
    290           "answer": true,
    291           "justification": "The pipeline from dataset selection → metadata generation (GPT-4o with manual review) → instruction construction → fine-tuning → evaluation is described sequentially in Sections 4.1 and 5.1.",
    292           "source": "haiku"
    293         }
    294       },
    295       "contamination": {
    296         "training_cutoff_stated": {
    297           "applies": true,
    298           "answer": false,
    299           "justification": "No training data cutoff is stated for GPT-4o or Llama3.1-8B-Instruct, which is relevant since these models may have seen the benchmark tabular datasets during pretraining.",
    300           "source": "haiku"
    301         },
    302         "train_test_overlap_discussed": {
    303           "applies": true,
    304           "answer": false,
    305           "justification": "The possibility that GPT-4o or Llama3.1 were pretrained on these public tabular datasets is not discussed, which could inflate GPT-4o's apparent performance on familiar datasets.",
    306           "source": "haiku"
    307         },
    308         "benchmark_contamination_addressed": {
    309           "applies": true,
    310           "answer": false,
    311           "justification": "Standard benchmarks (adult income, iris, Boston housing, diabetes, etc.) are widely used and almost certainly in GPT-4o's training data; this is not acknowledged or controlled for.",
    312           "source": "haiku"
    313         }
    314       },
    315       "human_studies": {
    316         "pre_registered": {
    317           "applies": false,
    318           "answer": false,
    319           "justification": "No human participants in this study.",
    320           "source": "haiku"
    321         },
    322         "irb_or_ethics_approval": {
    323           "applies": false,
    324           "answer": false,
    325           "justification": "No human participants.",
    326           "source": "haiku"
    327         },
    328         "demographics_reported": {
    329           "applies": false,
    330           "answer": false,
    331           "justification": "No human participants.",
    332           "source": "haiku"
    333         },
    334         "inclusion_exclusion_criteria": {
    335           "applies": false,
    336           "answer": false,
    337           "justification": "No human participants.",
    338           "source": "haiku"
    339         },
    340         "randomization_described": {
    341           "applies": false,
    342           "answer": false,
    343           "justification": "No human participants.",
    344           "source": "haiku"
    345         },
    346         "blinding_described": {
    347           "applies": false,
    348           "answer": false,
    349           "justification": "No human participants.",
    350           "source": "haiku"
    351         },
    352         "attrition_reported": {
    353           "applies": false,
    354           "answer": false,
    355           "justification": "No human participants.",
    356           "source": "haiku"
    357         }
    358       },
    359       "cost_and_practicality": {
    360         "inference_cost_reported": {
    361           "applies": true,
    362           "answer": false,
    363           "justification": "No inference latency or cost is reported for ITT-GEN or GPT-4o at inference time; only training cost (less than 6 hours on one A100) is mentioned.",
    364           "source": "haiku"
    365         },
    366         "compute_budget_stated": {
    367           "applies": true,
    368           "answer": true,
    369           "justification": "Training compute is explicitly stated: single A100 80GB GPU for less than 6 hours, which is the paper's primary practical selling point.",
    370           "source": "haiku"
    371         }
    372       }
    373     }
    374   },
    375   "claims": [
    376     {
    377       "claim": "Instruction tuning on 7K high-quality instructions with one A100 GPU achieves tabular data generation performance on par with GPT-4o.",
    378       "evidence": "Tables 1 and 2 show ITT-GEN competitive with GPT-4o on several datasets, but GPT-4o systematically outperforms on most Shape metrics (e.g., adult 92.34 vs 85.73, bank 93.42 vs 85.57) and most utility metrics.",
    379       "supported": "weak"
    380     },
    381     {
    382       "claim": "Instruction tuning substantially improves the base LLM's capability for tabular data generation over the untuned model.",
    383       "evidence": "The base LLM fails to generate tabular data for ~80% of instructions; after fine-tuning, ITT-GEN consistently produces valid tabular output with much higher fidelity and utility scores across all datasets.",
    384       "supported": "strong"
    385     },
    386     {
    387       "claim": "High-quality metadata (general and column-wise descriptions) is important for steering LLMs in tabular data generation.",
    388       "evidence": "The paper states 'preliminary experimental results suggest the importance of high-quality metadata' but no ablation table is provided to support this claim quantitatively.",
    389       "supported": "unsupported"
    390     },
    391     {
    392       "claim": "The proposed approach is model-agnostic and also improves TableLlama as a base model.",
    393       "evidence": "Appendix Tables 4 and 5 show ITT-GEN improves TableLlama from complete failure (all '--') to producing valid outputs, though with a large gap remaining relative to GPT-4o.",
    394       "supported": "moderate"
    395     },
    396     {
    397       "claim": "Using rows as expected output during instruction tuning is better than next-token prediction used in prior work.",
    398       "evidence": "This is stated as a finding ('Our empirical results show that...') but no ablation table comparing the two approaches is provided.",
    399       "supported": "unsupported"
    400     }
    401   ],
    402   "methodology_tags": [
    403     "benchmark-eval"
    404   ],
    405   "key_findings": "The paper proposes ITT-GEN, an instruction-tuned Llama3.1-8B-Instruct model for tabular data generation, trained on a 7K instruction dataset created from 20 public tabular datasets. The fine-tuned model substantially outperforms the untuned base LLM, which fails to produce valid tabular output in ~80% of cases. However, the claim of performance 'on par with GPT-4o' is overstated — GPT-4o outperforms ITT-GEN on most fidelity and utility metrics. The primary contribution is demonstrating that instruction tuning for tabular generation is feasible with modest compute, not that it matches frontier commercial models.",
    406   "red_flags": [
    407     {
    408       "flag": "No comparison to domain-appropriate baselines",
    409       "detail": "The paper compares only against the untuned base LLM and GPT-4o, omitting all dedicated tabular synthesis methods (CTAB-GAN+, GReaT/Tabula, TabDiff, TTVAE) that are the actual state of the art for this task. This makes the 'competitive' framing misleading."
    410     },
    411     {
    412       "flag": "Overclaiming 'on par with GPT-4o'",
    413       "detail": "GPT-4o outperforms ITT-GEN on the majority of Shape and utility metrics across Tables 1–2. The abstract's 'on par' framing is not supported by the data."
    414     },
    415     {
    416       "flag": "Base LLM metric manipulation",
    417       "detail": "Base LLM fidelity metrics are calculated on only ~20% of its outputs (valid tabular portions), with ~80% discarded as irrelevant. The paper acknowledges this but still reports these inflated partial metrics in the comparison table without clear visual distinction."
    418     },
    419     {
    420       "flag": "Commercial affiliation undisclosed",
    421       "detail": "The corresponding author's email is @betterdata.ai, a commercial synthetic data company that directly benefits from these results. No conflict of interest statement is present."
    422     },
    423     {
    424       "flag": "No variance or statistical testing",
    425       "detail": "All results are single-run point estimates with no standard deviations, confidence intervals, or significance tests, making it impossible to assess whether observed differences are meaningful."
    426     },
    427     {
    428       "flag": "No ablation study despite ablation claims",
    429       "detail": "The paper claims metadata quality is important and that row-output format beats next-token prediction, but neither claim is backed by an ablation table."
    430     },
    431     {
    432       "flag": "Potential benchmark contamination unaddressed",
    433       "detail": "Well-known datasets like adult income, iris, diabetes, and Boston housing are almost certainly in GPT-4o's pretraining corpus. This could inflate GPT-4o's benchmark performance in ways not controlled for."
    434     }
    435   ],
    436   "cited_papers": [
    437     {
    438       "title": "TableLlama: Towards Open Large Generalist Models for Tables",
    439       "relevance": "State-of-the-art open-source LLM for table understanding tasks; used as additional base model for ITT-GEN and as a key baseline reference."
    440     },
    441     {
    442       "title": "TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios",
    443       "relevance": "Related tabular instruction tuning work for QA and code generation; used to situate the research gap."
    444     },
    445     {
    446       "title": "Rethinking Table Instruction Tuning (TAMA)",
    447       "relevance": "Most recent prior work on tabular instruction tuning; cited to establish the gap in tabular data generation."
    448     },
    449     {
    450       "title": "Language Models are Realistic Tabular Data Generators (GReaT/Tabula)",
    451       "relevance": "Prior LLM-based tabular generation approach that uses fine-tuning rather than instruction tuning; directly related to the paper's task."
    452     },
    453     {
    454       "title": "TabDiff: A Mixed-Type Diffusion Model for Tabular Data Generation",
    455       "relevance": "Contemporary state-of-the-art tabular generation method that should have been used as a baseline but was not."
    456     },
    457     {
    458       "title": "CTAB-GAN+: Enhancing Tabular Data Synthesis",
    459       "relevance": "GAN-based tabular synthesis baseline; cited as prior work but not compared against experimentally."
    460     },
    461     {
    462       "title": "Training Language Models to Follow Instructions with Human Feedback (InstructGPT)",
    463       "relevance": "Foundational work on instruction tuning that this paper extends to the tabular domain."
    464     },
    465     {
    466       "title": "Modeling Tabular Data using Conditional GAN (CTGAN)",
    467       "relevance": "Standard baseline for tabular data synthesis and source of evaluation metrics (TSTR framework)."
    468     }
    469   ],
    470   "engagement_factors": {
    471     "practical_relevance": {
    472       "score": 2,
    473       "justification": "Practitioners who need synthetic tabular data can fine-tune an open-source model cheaply (one A100, <6 hours) instead of paying for GPT-4o API calls."
    474     },
    475     "surprise_contrarian": {
    476       "score": 1,
    477       "justification": "The finding that 7K instruction examples can yield competitive tabular generation is somewhat surprising given the scale typically required, but the results are more modest than claimed."
    478     },
    479     "fear_safety": {
    480       "score": 0,
    481       "justification": "Synthetic tabular data generation has privacy implications but none are discussed; no AI safety concerns raised."
    482     },
    483     "drama_conflict": {
    484       "score": 0,
    485       "justification": "No controversy; standard ML paper in a niche area."
    486     },
    487     "demo_ability": {
    488       "score": 1,
    489       "justification": "In principle the model could be recreated and demoed, but no code or model weights are released, making immediate hands-on use impossible."
    490     },
    491     "brand_recognition": {
    492       "score": 1,
    493       "justification": "NUS affiliation provides minor recognition; Betterdata AI is a small commercial entity with limited name recognition in ML research."
    494     }
    495   },
    496   "hn_data": {
    497     "threads": [
    498       {
    499         "hn_id": "46523638",
    500         "title": "An Electronic Ising Machine",
    501         "points": 5,
    502         "comments": 1,
    503         "url": "https://news.ycombinator.com/item?id=46523638",
    504         "created_at": "2026-01-07T07:38:39Z"
    505       }
    506     ],
    507     "top_points": 5,
    508     "total_points": 5,
    509     "total_comments": 1
    510   }
    511 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs