calibration.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

calibration.json (20112B)
      1 {
      2   "paper_slug": "3dshape2vecset-3d-shape-2023",
      3   "total_questions": 50,
      4   "agreement_count": 46,
      5   "disagreement_count": 4,
      6   "agreement_rate": 0.92,
      7   "disagreements": [
      8     {
      9       "category": "statistical_methodology",
     10       "question": "effect_sizes_reported",
     11       "sonnet_applies": true,
     12       "sonnet_answer": false,
     13       "opus_applies": true,
     14       "opus_answer": true,
     15       "opus_justification": "The paper reports raw metric values with baseline context throughout. For example, Table 3 shows IoU improvements from 0.953 (3DILG) to 0.965 (Ours), Table 6 shows Surface-FPD from 1.89 (3DILG) to 0.76 (Ours). Per the schema, 'A paper that says 12% improvement over baseline (from 45% to 57%) provides enough context for YES.' The paper consistently provides baseline values alongside its own, enabling the reader to assess the magnitude of improvement.",
     16       "sonnet_justification": "Raw metric differences are shown in tables but no standardized effect sizes (Cohen's d, etc.) are reported. The paper does not provide baseline context for interpreting magnitude of improvements.",
     17       "direction": "opus_generous"
     18     },
     19     {
     20       "category": "evaluation_design",
     21       "question": "human_evaluation",
     22       "sonnet_applies": false,
     23       "sonnet_answer": false,
     24       "opus_applies": true,
     25       "opus_answer": false,
     26       "opus_justification": "Human evaluation is relevant to 3D shape generation quality claims. Automated metrics like FID and IoU are proxies; human perceptual judgment of generated 3D shapes is a meaningful evaluation that the paper could have but did not include. The schema says 'NA if human evaluation is clearly irrelevant to the claims' -- but for generative modeling, human evaluation of output quality is standard and relevant. The paper did not include any human evaluation.",
     27       "sonnet_justification": "This is a 3D shape reconstruction and generation paper evaluated entirely with automated metrics (IoU, FID, etc.); human evaluation is not relevant to the geometric quality claims being made.",
     28       "direction": "applies_boundary"
     29     },
     30     {
     31       "category": "evaluation_design",
     32       "question": "failure_cases_discussed",
     33       "sonnet_applies": true,
     34       "sonnet_answer": false,
     35       "opus_applies": true,
     36       "opus_answer": true,
     37       "opus_justification": "Section 4 explicitly states: 'We initially explored many variations for 3D shape representation based on irregular and regular grids as well as tri-planes, frequency compositions, and other factored representations. Ultimately, we could not improve on existing irregular grids.' This constitutes a discussion of where certain approaches failed. Additionally, Section 8.8 (Limitations) discusses the drawbacks of the two-stage training strategy, including the cost of retraining. While not a traditional failure-case visualization, the paper does discuss where the approach has weaknesses.",
     38       "sonnet_justification": "The limitations section (8.8) discusses training cost issues but does not show failure cases or discuss specific shapes/scenarios where the method breaks down. Only successes are visualized.",
     39       "direction": "opus_generous"
     40     },
     41     {
     42       "category": "cost_and_practicality",
     43       "question": "compute_budget_stated",
     44       "sonnet_applies": true,
     45       "sonnet_answer": false,
     46       "opus_applies": true,
     47       "opus_answer": true,
     48       "opus_justification": "Section 7.3 states: 'The shape auto-encoder is trained on 8 A100, with batch size of 512 for T=1,600 epochs' and 'The diffusion models are trained on 4 A100 with batch size of 256 for T=8,000 epochs.' While the paper does not give a single total GPU-hours figure, it specifies the hardware (8 A100 GPUs, 4 A100 GPUs), the number of epochs (1,600 and 8,000), and batch sizes. This is sufficient for a reader to understand the approximate compute budget. The schema asks 'Is the total computational budget stated? Look for: GPU hours, total API spend, hardware used, training time.' Hardware used and training time (in epochs) are provided.",
     49       "sonnet_justification": "While hardware (8 A100 for autoencoder, 4 A100 for diffusion) and epoch counts are stated, the total GPU-hours or compute budget is not explicitly quantified.",
     50       "direction": "opus_generous"
     51     }
     52   ],
     53   "opus_checklist": {
     54     "artifacts": {
     55       "code_released": {
     56         "applies": true,
     57         "answer": true,
     58         "justification": "The abstract states 'Code: https://1zb.github.io/3DShape2VecSet/' -- a working URL is provided."
     59       },
     60       "data_released": {
     61         "applies": true,
     62         "answer": true,
     63         "justification": "The paper uses ShapeNet-v2 (Chang et al. 2015), a publicly available standard benchmark dataset. Per schema guidance, 'If the data is a standard public benchmark they didn't modify, YES.'"
     64       },
     65       "environment_specified": {
     66         "applies": true,
     67         "answer": false,
     68         "justification": "The paper describes hardware (8 A100 GPUs for autoencoder, 4 A100 for diffusion) but provides no requirements.txt, Dockerfile, conda file, or library version specifications. Only 'Python' is implied without detailed dependency information."
     69       },
     70       "reproduction_instructions": {
     71         "applies": true,
     72         "answer": false,
     73         "justification": "A code URL is provided but the paper text itself does not include step-by-step reproduction instructions or commands to replicate the main experiments."
     74       }
     75     },
     76     "statistical_methodology": {
     77       "confidence_intervals_or_error_bars": {
     78         "applies": true,
     79         "answer": false,
     80         "justification": "All results in Tables 3-9 report only point estimates (IoU, Chamfer, F-Score, FID, KID values) with no confidence intervals, error bars, or uncertainty measures."
     81       },
     82       "significance_tests": {
     83         "applies": true,
     84         "answer": false,
     85         "justification": "The paper makes comparative claims (e.g., 'we can beat PVD by a large margin') without any statistical significance tests. No p-values, t-tests, or bootstrap tests are reported."
     86       },
     87       "effect_sizes_reported": {
     88         "applies": true,
     89         "answer": true,
     90         "justification": "Per the schema definition, reporting baseline values alongside the method's own results provides the context needed. Tables 3-9 consistently show baseline metrics next to the proposed method's metrics (e.g., IoU 0.953 for 3DILG vs 0.965 for Ours in Table 3; FPD 1.89 for 3DILG vs 0.76 for Ours in Table 6). The reader can directly assess the magnitude of improvement from these comparisons."
     91       },
     92       "sample_size_justified": {
     93         "applies": true,
     94         "answer": false,
     95         "justification": "No power analysis or justification for why the number of test shapes or categories is sufficient for the comparative claims. The evaluation uses ShapeNet-v2 with 55 categories but no justification for adequacy."
     96       },
     97       "variance_reported": {
     98         "applies": true,
     99         "answer": false,
    100         "justification": "All results appear to be from single runs. No standard deviations, variance across seeds, or multiple-run results are reported in any table."
    101       }
    102     },
    103     "evaluation_design": {
    104       "baselines_included": {
    105         "applies": true,
    106         "answer": true,
    107         "justification": "The paper compares against OccNet, ConvOccNet, IF-Net, 3DILG for shape autoencoding (Table 3); PVD, 3DILG, NeuralWavelet, Grid-83 for generation (Tables 6-9); AutoSDF for text-conditioned; ShapeFormer for completion."
    108       },
    109       "baselines_contemporary": {
    110         "applies": true,
    111         "answer": true,
    112         "justification": "Baselines include 3DILG (NeurIPS 2022), NeuralWavelet (SIGGRAPH Asia 2022), PVD (ICCV 2021) -- all contemporary to this 2023 paper."
    113       },
    114       "ablation_study": {
    115         "applies": true,
    116         "answer": true,
    117         "justification": "Table 4 ablates number of latents M (512, 256, 128, 64). Table 5 ablates KL channel size C0 (1, 2, 4, 8, 16, 32, 64). Section 8.1 compares Learned Queries vs. Point Queries. Table 6 shows generation results across different C0 values."
    118       },
    119       "multiple_metrics": {
    120         "applies": true,
    121         "answer": true,
    122         "justification": "Shape autoencoding uses IoU, Chamfer distance, and F-Score. Generation uses Surface-FPD, Surface-KPD, Rendering-FID, Rendering-KID. Category-conditioned adds Precision, Recall, MMD-CD, MMD-EMD, COV-CD, COV-EMD."
    123       },
    124       "human_evaluation": {
    125         "applies": true,
    126         "answer": false,
    127         "justification": "This is a generative modeling paper. Human perceptual evaluation of generated 3D shapes is a relevant and commonly used evaluation method in generative modeling research. The paper relies entirely on automated metrics (FID, KID, IoU, etc.) with no human evaluation of output quality."
    128       },
    129       "held_out_test_set": {
    130         "applies": true,
    131         "answer": true,
    132         "justification": "Section 7 states 'We use the training/val splits in [Zhang et al. 2022]'. Section 8.1 refers to 'test split' shapes for visualization. Explicit training/test separation is used."
    133       },
    134       "per_category_breakdown": {
    135         "applies": true,
    136         "answer": true,
    137         "justification": "Table 3 provides per-category results for the 7 largest ShapeNet categories plus averages. Table 8 shows per-category generation results for airplane, chair, table, car, sofa. Table 9 adds chair and table breakdowns."
    138       },
    139       "failure_cases_discussed": {
    140         "applies": true,
    141         "answer": true,
    142         "justification": "Section 4 states: 'We initially explored many variations for 3D shape representation based on irregular and regular grids as well as tri-planes, frequency compositions, and other factored representations. Ultimately, we could not improve on existing irregular grids.' This is an explicit discussion of approaches that failed. Section 8.8 discusses limitations of the two-stage training approach."
    143       },
    144       "negative_results_reported": {
    145         "applies": true,
    146         "answer": true,
    147         "justification": "Section 8.2 reports that C0=64 gives worse generation results than C0=32. Section 4 discusses that irregular/regular grids and tri-planes were explored but 'could not improve on existing irregular grids.' These are genuine negative results."
    148       }
    149     },
    150     "claims_and_evidence": {
    151       "abstract_claims_supported": {
    152         "applies": true,
    153         "answer": true,
    154         "justification": "The abstract claims 'improved performance in 3D shape encoding and 3D shape generative modeling tasks' -- supported by Tables 3-9 showing improvements over baselines. It claims a 'wide variety of generative applications' -- supported by Sections 8.2-8.6 covering unconditional, category-conditioned, text-conditioned, completion, and image-conditioned generation."
    155       },
    156       "causal_claims_justified": {
    157         "applies": true,
    158         "answer": true,
    159         "justification": "The paper makes causal claims through ablation studies (e.g., 'Point Queries are generally better than Learned Queries', effect of M and C0). These are supported by controlled single-variable ablation experiments in Tables 3-5 where one component is varied while others are held constant."
    160       },
    161       "generalization_bounded": {
    162         "applies": true,
    163         "answer": false,
    164         "justification": "The abstract claims 'improved performance in 3D shape encoding and 3D shape generative modeling tasks' without bounding to ShapeNet. The title is broadly framed. All experiments use only ShapeNet-v2 (synthetic man-made objects). No explicit scope boundaries are stated regarding the data domain."
    165       },
    166       "alternative_explanations_discussed": {
    167         "applies": true,
    168         "answer": false,
    169         "justification": "The paper does not discuss alternative explanations for observed improvements. It does not consider whether gains come from the representation itself, increased model capacity, the attention architecture, or the specific training recipe. The limitations section only discusses training cost, not confounds."
    170       }
    171     },
    172     "setup_transparency": {
    173       "model_versions_specified": {
    174         "applies": false,
    175         "answer": false,
    176         "justification": "This paper trains its own neural networks from scratch. It does not use pre-trained LLMs or API-based models. BERT is used for text conditioning but referenced by citation, and the schema question targets LLM API versions which is not applicable here."
    177       },
    178       "prompts_provided": {
    179         "applies": false,
    180         "answer": false,
    181         "justification": "The paper does not use prompting or LLM APIs. All models are trained end-to-end from scratch. Text conditioning uses BERT embeddings, not prompt-based generation."
    182       },
    183       "hyperparameters_reported": {
    184         "applies": true,
    185         "answer": true,
    186         "justification": "Section 7.3 provides detailed hyperparameters: batch sizes (512 for autoencoder, 256 for diffusion), learning rates (5e-5 max for autoencoder, 1e-4 for diffusion), epochs (1,600 and 8,000), learning rate schedule (linear warmup + cosine decay), KL weight (0.001), M=512, C=512, C0=32. EDM default settings are also referenced."
    187       },
    188       "scaffolding_described": {
    189         "applies": false,
    190         "answer": false,
    191         "justification": "No agentic scaffolding is used. This is a neural network training and evaluation paper."
    192       },
    193       "data_preprocessing_documented": {
    194         "applies": true,
    195         "answer": true,
    196         "justification": "Section 7 documents preprocessing: shapes converted to watertight meshes, normalized to bounding box, 500K surface points sampled, 500K occupancy points from bounding volume, 500K near-surface points. Shape completion preprocessing and image rendering setup are also described."
    197       }
    198     },
    199     "limitations_and_scope": {
    200       "limitations_section_present": {
    201         "applies": true,
    202         "answer": true,
    203         "justification": "Section 8.8 is explicitly titled 'Limitations' and provides substantive discussion of the two-stage training drawbacks."
    204       },
    205       "threats_to_validity_specific": {
    206         "applies": true,
    207         "answer": false,
    208         "justification": "The limitations section discusses only training cost and retraining needs -- these are practical limitations, not threats to the validity of the performance claims. No discussion of whether ShapeNet results generalize, metric limitations, single-run variance, or other validity threats."
    209       },
    210       "scope_boundaries_stated": {
    211         "applies": true,
    212         "answer": false,
    213         "justification": "The paper does not explicitly state what its results do NOT show. No statements bound claims to ShapeNet-v2 or note that results may not transfer to real-world scanned data, non-rigid objects, or other domains."
    214       }
    215     },
    216     "data_integrity": {
    217       "raw_data_available": {
    218         "applies": true,
    219         "answer": true,
    220         "justification": "ShapeNet-v2 is a publicly available dataset that enables independent verification. The specific train/val splits from Zhang et al. 2022 are referenced."
    221       },
    222       "data_collection_described": {
    223         "applies": true,
    224         "answer": true,
    225         "justification": "Section 7 describes data source (ShapeNet-v2), splits (from Zhang et al. 2022), and preprocessing pipeline. Additional data sources for specific tasks are described (3D-R2N2 renderings for image conditioning, ShapeGlot text prompts)."
    226       },
    227       "recruitment_methods_described": {
    228         "applies": false,
    229         "answer": false,
    230         "justification": "No human participants. The data is a standard public 3D shape benchmark (ShapeNet-v2)."
    231       },
    232       "data_pipeline_documented": {
    233         "applies": true,
    234         "answer": true,
    235         "justification": "Section 7 documents the full pipeline: raw ShapeNet meshes -> watertight conversion -> bounding box normalization -> 500K surface point sampling -> 500K bounding volume occupancy sampling + 500K near-surface sampling. Each step is described."
    236       }
    237     },
    238     "conflicts_of_interest": {
    239       "funding_disclosed": {
    240         "applies": true,
    241         "answer": true,
    242         "justification": "Acknowledgments section states: 'This work was supported by the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI) as well as the ERC Starting Grant Scan2CAD (804724).'"
    243       },
    244       "affiliations_disclosed": {
    245         "applies": true,
    246         "answer": true,
    247         "justification": "Author affiliations are clearly listed: Biao Zhang and Peter Wonka at KAUST; Jiapeng Tang and Matthias Niessner at TU Munich."
    248       },
    249       "funder_independent_of_outcome": {
    250         "applies": true,
    251         "answer": true,
    252         "justification": "SDAIA-KAUST AI (government-academic center) and the ERC (European Research Council) are academic/government funders with no financial stake in whether this 3D shape representation outperforms competitors."
    253       },
    254       "financial_interests_declared": {
    255         "applies": true,
    256         "answer": false,
    257         "justification": "There is no competing interests or financial interests statement in the paper. Per the schema, 'If there is no competing interests statement at all, NO -- absence of disclosure is not the same as absence of conflict.'"
    258       }
    259     },
    260     "contamination": {
    261       "training_cutoff_stated": {
    262         "applies": false,
    263         "answer": false,
    264         "justification": "This paper trains its own neural networks from scratch on ShapeNet-v2. It does not evaluate a pre-trained foundation model's capability on a benchmark. Contamination questions are not applicable."
    265       },
    266       "train_test_overlap_discussed": {
    267         "applies": false,
    268         "answer": false,
    269         "justification": "Same as above -- the paper trains its own models with explicit train/val splits. Pre-training data contamination does not apply."
    270       },
    271       "benchmark_contamination_addressed": {
    272         "applies": false,
    273         "answer": false,
    274         "justification": "Same as above -- no pre-trained model capabilities are being evaluated."
    275       }
    276     },
    277     "human_studies": {
    278       "pre_registered": {
    279         "applies": false,
    280         "answer": false,
    281         "justification": "No human participants in this study."
    282       },
    283       "irb_or_ethics_approval": {
    284         "applies": false,
    285         "answer": false,
    286         "justification": "No human participants."
    287       },
    288       "demographics_reported": {
    289         "applies": false,
    290         "answer": false,
    291         "justification": "No human participants."
    292       },
    293       "inclusion_exclusion_criteria": {
    294         "applies": false,
    295         "answer": false,
    296         "justification": "No human participants."
    297       },
    298       "randomization_described": {
    299         "applies": false,
    300         "answer": false,
    301         "justification": "No human participants."
    302       },
    303       "blinding_described": {
    304         "applies": false,
    305         "answer": false,
    306         "justification": "No human participants."
    307       },
    308       "attrition_reported": {
    309         "applies": false,
    310         "answer": false,
    311         "justification": "No human participants."
    312       }
    313     },
    314     "cost_and_practicality": {
    315       "inference_cost_reported": {
    316         "applies": true,
    317         "answer": false,
    318         "justification": "The paper mentions 18 denoising steps for sampling but does not report inference latency, wall-clock time per sample, or cost per generated shape."
    319       },
    320       "compute_budget_stated": {
    321         "applies": true,
    322         "answer": true,
    323         "justification": "Section 7.3 specifies hardware and training duration: 'The shape auto-encoder is trained on 8 A100, with batch size of 512 for T=1,600 epochs. The diffusion models are trained on 4 A100 with batch size of 256 for T=8,000 epochs.' The schema asks to look for 'GPU hours, total API spend, hardware used, training time.' Hardware used and training time are provided."
    324       }
    325     }
    326   }
    327 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs