scan-v5.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan-v5.json (25915B)
      1 {
      2   "scan_version": 5,
      3   "paper_type": "empirical",
      4   "paper": {
      5     "title": "ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces",
      6     "authors": [
      7       "Ramchand Kumaresan"
      8     ],
      9     "year": 2026,
     10     "venue": "arXiv",
     11     "arxiv_id": "2602.21231",
     12     "doi": null
     13   },
     14   "checklist": {
     15     "claims_and_evidence": {
     16       "abstract_claims_supported": {
     17         "applies": true,
     18         "answer": true,
     19         "justification": "All abstract claims are supported: σ routing achieves 55.6% (Table 1), exceeds Arena-2's 54.4% (Table 1), avoids full ensemble on 54.2% of tasks (32.9%+21.3% from Section 5.3), and retrieval decreases accuracy by 3.4pp (Table 2).",
     20         "source": "haiku"
     21       },
     22       "causal_claims_justified": {
     23         "applies": true,
     24         "answer": true,
     25         "justification": "Causal routing decisions are mechanistically specified in Algorithm 1: σ value deterministically selects execution mode. Retrieval harm is explained by median similarity of 0.167 (Figure 9), providing mechanistic justification for negative finding.",
     26         "source": "haiku"
     27       },
     28       "generalization_bounded": {
     29         "applies": true,
     30         "answer": true,
     31         "justification": "Results bounded to 4 benchmarks and 3 proprietary models (Claude, GPT-4o, Gemini). Abstract claims model-agnostic design but Section 8 limitations state may not generalize to open-source models and SuperGPQA dominates (66%).",
     32         "source": "haiku"
     33       },
     34       "alternative_explanations_discussed": {
     35         "applies": true,
     36         "answer": true,
     37         "justification": "Paper discusses why σ-based approach chosen over learned routers (auditability trade-off, Section 3.2.3), why retrieval failed (weak similarity, Section 6.1), and mechanistic reasons for agreement-but-wrong failure (Section 6.2).",
     38         "source": "haiku"
     39       },
     40       "proxy_outcome_distinction": {
     41         "applies": true,
     42         "answer": true,
     43         "justification": "Measures accuracy on ground-truth answers (pass rates, code execution verification) and cost in USD. Claims match measurement granularity. Acknowledges code equivalence issue (Section 8) but does not claim code equivalence as accuracy.",
     44         "source": "haiku"
     45       }
     46     },
     47     "limitations_and_scope": {
     48       "limitations_section_present": {
     49         "applies": true,
     50         "answer": true,
     51         "justification": "Dedicated Section 8 'Limitations' with four specific constraints, plus Section 6 'Negative Results and Failure Modes' documenting three systematic failures beyond scope.",
     52         "source": "haiku"
     53       },
     54       "threats_to_validity_specific": {
     55         "applies": true,
     56         "answer": true,
     57         "justification": "Section 8 specifies: proprietary models only (no open-source), SuperGPQA dominance (66% of tasks), no learned router comparison, code equivalence inflation. Section 6.2 quantifies fundamental 8pp ceiling.",
     58         "source": "haiku"
     59       },
     60       "scope_boundaries_stated": {
     61         "applies": true,
     62         "answer": true,
     63         "justification": "Explicit boundaries: 1,510 tasks across 4 named benchmarks, 3 specific models, deterministic evaluation at temperature 0, σ routing with N=3 samples. What does NOT show: generalization to open-source models, to other model counts, to different problem domains.",
     64         "source": "haiku"
     65       }
     66     },
     67     "conflicts_of_interest": {
     68       "funding_disclosed": {
     69         "applies": true,
     70         "answer": false,
     71         "justification": "No funding section or acknowledgments disclosing financial support. Paper does not state whether funded by academic institution, company, or independent.",
     72         "source": "haiku"
     73       },
     74       "affiliations_disclosed": {
     75         "applies": true,
     76         "answer": false,
     77         "justification": "Author affiliation not stated in paper header or acknowledgments. Single author 'Ramchand Kumaresan' with no institution listed.",
     78         "source": "haiku"
     79       },
     80       "funder_independent_of_outcome": {
     81         "applies": false,
     82         "answer": false,
     83         "justification": "Cannot assess; funding not disclosed. Paper evaluates Claude (Anthropic), GPT-4o (OpenAI), Gemini (Google) — potential conflict if author affiliated with any provider, but affiliation unclear.",
     84         "source": "haiku"
     85       },
     86       "financial_interests_declared": {
     87         "applies": true,
     88         "answer": false,
     89         "justification": "No competing interests statement. Paper discloses AI assistance (Claude for code, ChatGPT for writing) but not financial interests, patents, or equity stakes.",
     90         "source": "haiku"
     91       }
     92     },
     93     "scope_and_framing": {
     94       "key_terms_defined": {
     95         "applies": true,
     96         "answer": true,
     97         "justification": "Self-consistency variance σ defined formally (Definition 1, Eq. 1), execution mode M(σ) defined (Definition 2), TEAMLLM substrate described (Section 3.1), ACAR routing procedure in Algorithm 1.",
     98         "source": "haiku"
     99       },
    100       "intended_contribution_clear": {
    101         "applies": true,
    102         "answer": true,
    103         "justification": "Three contributions explicitly stated in Section 1.2: (1) ACAR routing mechanism with empirical results, (2) negative results on retrieval and attribution, (3) TEAMLLM reproducible infrastructure. Abstract and introduction clearly frame as 'measurement framework.'",
    104         "source": "haiku"
    105       },
    106       "engagement_with_prior_work": {
    107         "applies": true,
    108         "answer": true,
    109         "justification": "Section 2 engages with three research areas: multi-model routing (RouterBench, FrugalGPT, RouteLLM), cost-aware inference, reproducible benchmarking. Section 2.3 explicitly states how ACAR differs from learned routers, observability platforms, and benchmark papers.",
    110         "source": "haiku"
    111       }
    112     }
    113   },
    114   "type_checklist": {
    115     "empirical": {
    116       "artifacts": {
    117         "code_released": {
    118           "applies": true,
    119           "answer": true,
    120           "justification": "Code and TEAMLLM substrate released at https://github.com/mechramc/ACAR-TeamLLM (stated in Section 3.1 and Appendix A). Figure regeneration scripts included.",
    121           "source": "haiku"
    122         },
    123         "data_released": {
    124           "applies": true,
    125           "answer": true,
    126           "justification": "All 7,550+ runs (1,510 ACAR-U, 1,510 ACAR-UJ, plus baselines) released as runs.jsonl with decision traces (Appendix B). Input benchmarks are public: LiveCodeBench, SuperGPQA, etc.",
    127           "source": "haiku"
    128         },
    129         "environment_specified": {
    130           "applies": true,
    131           "answer": false,
    132           "justification": "Paper logs environment fingerprints with each run (Section 3.1) but does not specify requirements.txt, Python version, or dependencies in the paper text. Environment specs must be inferred from GitHub repo.",
    133           "source": "haiku"
    134         },
    135         "reproduction_instructions": {
    136           "applies": true,
    137           "answer": false,
    138           "justification": "Paper states 'All figures regenerable from released artifacts' but provides no step-by-step instructions in the paper. Artifact manifest (Appendix B) lists directories but not how to execute them. GitHub repo likely has README, but paper itself lacks instructions.",
    139           "source": "haiku"
    140         }
    141       },
    142       "statistical_methodology": {
    143         "confidence_intervals_or_error_bars": {
    144           "applies": true,
    145           "answer": false,
    146           "justification": "Table 1 reports point estimates (55.6%, 54.4%, etc.) with no confidence intervals or error bars. Figures 2-7 show bar/line charts without error bands. No uncertainty quantification.",
    147           "source": "haiku"
    148         },
    149         "significance_tests": {
    150           "applies": true,
    151           "answer": false,
    152           "justification": "Comparative claims ('ACAR-U exceeds Arena-2 by 1.2pp') lack p-values or statistical significance tests. No tests for difference in accuracy between configurations.",
    153           "source": "haiku"
    154         },
    155         "effect_sizes_reported": {
    156           "applies": true,
    157           "answer": true,
    158           "justification": "Effect sizes reported: ACAR-U +1.2pp vs Arena-2 (Table 1), retrieval -3.4pp (Table 2), cost differences in USD. Escalation rates given as percentages (32.9% single, etc.).",
    159           "source": "haiku"
    160         },
    161         "sample_size_justified": {
    162           "applies": true,
    163           "answer": false,
    164           "justification": "1,510 tasks evaluated but no justification for this n. Section 4.1 explains benchmark selection as 'cover diverse task types' but no power analysis or sample size calculation provided.",
    165           "source": "haiku"
    166         },
    167         "variance_reported": {
    168           "applies": true,
    169           "answer": false,
    170           "justification": "Execution is deterministic (Section 3.1: 're-execution with identical inputs produces identical outputs'). Single runs per configuration reported; no standard deviations, confidence intervals, or repeated evaluations.",
    171           "source": "haiku"
    172         }
    173       },
    174       "evaluation_design": {
    175         "baselines_included": {
    176           "applies": true,
    177           "answer": true,
    178           "justification": "Four configurations compared: Single-Model, Arena-2, ACAR-U, Arena-3 (Table 1, Section 4.3). Ablation comparing ACAR-U vs ACAR-UJ (Table 2).",
    179           "source": "haiku"
    180         },
    181         "baselines_contemporary": {
    182           "applies": true,
    183           "answer": false,
    184           "justification": "Baselines (single-model, two-model, three-model ensembles) are simple but not comparative to other routing methods. Related work cites RouterBench, FrugalGPT, RouteLLM but none are evaluated. Paper acknowledges this as intentional design choice for auditability over optimization.",
    185           "source": "haiku"
    186         },
    187         "ablation_study": {
    188           "applies": true,
    189           "answer": true,
    190           "justification": "ACAR-U (without retrieval) vs ACAR-UJ (with retrieval augmentation) evaluated separately. Table 2 shows -3.4pp effect of adding Jungler retrieval component.",
    191           "source": "haiku"
    192         },
    193         "multiple_metrics": {
    194           "applies": true,
    195           "answer": true,
    196           "justification": "Accuracy (pass rate), cost (USD), escalation rate (% per mode), latency (ms), per-benchmark performance, per-mode breakdown all reported across Sections 5.1-5.4.",
    197           "source": "haiku"
    198         },
    199         "human_evaluation": {
    200           "applies": false,
    201           "answer": false,
    202           "justification": "N/A — evaluation on benchmarks (MathArena, Reasoning Gym, LiveCodeBench with code execution, SuperGPQA with multiple choice). No human subjects.",
    203           "source": "haiku"
    204         },
    205         "held_out_test_set": {
    206           "applies": false,
    207           "answer": false,
    208           "justification": "N/A — paper evaluates models on standard public benchmarks; it does not train models or hold out test data. Models are evaluated off-the-shelf.",
    209           "source": "haiku"
    210         },
    211         "per_category_breakdown": {
    212           "applies": true,
    213           "answer": true,
    214           "justification": "Results broken down by benchmark (Figure 3: MathArena, Reasoning Gym, LiveCodeBench, SuperGPQA), by execution mode (Figure 5: single/lite/full), by similarity/hit rate (Figures 8-9).",
    215           "source": "haiku"
    216         },
    217         "failure_cases_discussed": {
    218           "applies": true,
    219           "answer": true,
    220           "justification": "Section 6 'Negative Results and Failure Modes' documents three systematic failures: retrieval decreases accuracy (6.1, Table 2), agreement-but-wrong is unrecoverable (6.2), attribution proxies weak (6.3).",
    221           "source": "haiku"
    222         },
    223         "negative_results_reported": {
    224           "applies": true,
    225           "answer": true,
    226           "justification": "Negative results explicitly reported: ACAR-UJ -3.4pp vs ACAR-U (Table 2), 8pp ceiling from agreement-but-wrong (Section 6.2), attribution proxies 'showed weak correlation' (Section 6.3). Conclusion states 'What failed' alongside 'What worked.'",
    227           "source": "haiku"
    228         }
    229       },
    230       "setup_transparency": {
    231         "model_versions_specified": {
    232           "applies": true,
    233           "answer": false,
    234           "justification": "Models named: 'Claude Sonnet 4', 'GPT-4o', 'Gemini 2.0 Flash' (Section 4.2). No version hashes, snapshot dates, or training cutoff dates provided. Identifiable by name at publication time (2026-02) but not fully reproducible.",
    235           "source": "haiku"
    236         },
    237         "prompts_provided": {
    238           "applies": true,
    239           "answer": false,
    240           "justification": "Algorithm 1 mentions 'Mprobe(T)' sampling and EXTRACT function, but actual prompts/system instructions given to models are not shown. Section 3.1 mentions 'prompt template hash' is logged but template not disclosed.",
    241           "source": "haiku"
    242         },
    243         "hyperparameters_reported": {
    244           "applies": true,
    245           "answer": true,
    246           "justification": "σ thresholds (0.0, 0.5, 1.0) specified in Definition 1, N=3 samples justified in Section 3.2.3, temperature=0 for all models (Section 4.2), retrieval threshold=0.0 for ACAR-UJ (Section 3.2.4), with discussion of why thresholds >0.7 needed (Section 6.1).",
    247           "source": "haiku"
    248         },
    249         "scaffolding_described": {
    250           "applies": false,
    251           "answer": false,
    252           "justification": "N/A — paper evaluates models directly on benchmarks. No agentic scaffolding (tool use, chain-of-thought, step-by-step prompting) described. EXTRACT function handles answer canonicalization but not model scaffolding.",
    253           "source": "haiku"
    254         },
    255         "data_preprocessing_documented": {
    256           "applies": true,
    257           "answer": false,
    258           "justification": "Algorithm 1 mentions EXTRACT(ri) for canonicalization but does not detail how answers are extracted/compared. Section 8 acknowledges 'LiveCodeBench escalation is inflated by syntactically different but semantically equivalent outputs' but does not document preprocessing steps.",
    259           "source": "haiku"
    260         }
    261       },
    262       "data_integrity": {
    263         "raw_data_available": {
    264           "applies": true,
    265           "answer": true,
    266           "justification": "All runs (outputs) released as runs.jsonl with per-task decision traces. Appendix B lists: phase22_acar_u/runs.jsonl (1,510 ACAR-U runs), phase22_acar_uj/runs.jsonl, baseline runs (arena_3model, arena_2model, single_model).",
    267           "source": "haiku"
    268         },
    269         "data_collection_described": {
    270           "applies": false,
    271           "answer": false,
    272           "justification": "N/A — paper uses existing public benchmarks (MathArena, Reasoning Gym, LiveCodeBench, SuperGPQA). Does not describe collection of these benchmarks; evaluation section (4.1) describes benchmarks, not collection methodology.",
    273           "source": "haiku"
    274         },
    275         "recruitment_methods_described": {
    276           "applies": false,
    277           "answer": false,
    278           "justification": "N/A — no human participants. Benchmarks are task sets, not recruited subjects.",
    279           "source": "haiku"
    280         },
    281         "data_pipeline_documented": {
    282           "applies": true,
    283           "answer": true,
    284           "justification": "Section 3.1 describes TEAMLLM execution substrate: deterministic execution with seed/hash logging, immutable append-only artifacts (runs.jsonl), forward-only state machine. Algorithm 1 documents full routing procedure from task T to decision trace D.",
    285           "source": "haiku"
    286         }
    287       },
    288       "contamination": {
    289         "training_cutoff_stated": {
    290           "applies": true,
    291           "answer": false,
    292           "justification": "Models used (Claude Sonnet 4, GPT-4o, Gemini 2.0 Flash) are proprietary with unknown training data cutoffs. Paper does not state training dates or potential overlap with benchmark data.",
    293           "source": "haiku"
    294         },
    295         "train_test_overlap_discussed": {
    296           "applies": true,
    297           "answer": false,
    298           "justification": "Paper does not discuss whether benchmarks (MathArena, Reasoning Gym, LiveCodeBench, SuperGPQA) were in model training data. Potential contamination of proprietary models not addressed.",
    299           "source": "haiku"
    300         },
    301         "benchmark_contamination_addressed": {
    302           "applies": true,
    303           "answer": false,
    304           "justification": "LiveCodeBench mentioned as having 'temporal splits' (Section 2.3) but paper does not verify that tested models were trained before these benchmarks were created or evaluate contamination impact.",
    305           "source": "haiku"
    306         }
    307       },
    308       "cost_and_practicality": {
    309         "inference_cost_reported": {
    310           "applies": true,
    311           "answer": true,
    312           "justification": "Table 1 reports cost in USD: Single-Model $17.04, Arena-2 $20.64, ACAR-U $20.34, Arena-3 $20.64. Cost vs. accuracy Pareto frontier shown in Figure 4.",
    313           "source": "haiku"
    314         },
    315         "compute_budget_stated": {
    316           "applies": true,
    317           "answer": true,
    318           "justification": "Total compute budget stated: ACAR-U costs $20.34 total across 1,510 tasks (≈$0.013 per task). Section 5.2 compares cost-accuracy trade-off. Figure 6 shows cumulative cost progression.",
    319           "source": "haiku"
    320         }
    321       }
    322     }
    323   },
    324   "claims": [
    325     {
    326       "claim": "σ-based routing achieves 55.6% accuracy on 1,510 benchmark tasks",
    327       "evidence": "Table 1 reports ACAR-U accuracy of 55.6% (839/1510 correct)",
    328       "supported": "strong"
    329     },
    330     {
    331       "claim": "σ-based routing exceeds two-model baseline (Arena-2) by 1.2 percentage points while costing 1.5% less",
    332       "evidence": "Table 1: ACAR-U 55.6% ($20.34) vs Arena-2 54.4% ($20.64)",
    333       "supported": "strong"
    334     },
    335     {
    336       "claim": "ACAR avoids full ensembling on 54.2% of tasks by routing to single-agent or two-model modes",
    337       "evidence": "Section 5.3 reports escalation: 32.9% single-agent + 21.3% arena-lite = 54.2%, 45.8% full-arena",
    338       "supported": "strong"
    339     },
    340     {
    341       "claim": "Retrieval augmentation with low-quality stores decreases accuracy by 3.4 percentage points",
    342       "evidence": "Table 2 shows ACAR-UJ 52.4% vs ACAR-U 55.6%, difference -3.4pp across all benchmarks",
    343       "supported": "strong"
    344     },
    345     {
    346       "claim": "When models unanimously agree on incorrect answers (σ=0), no downstream ensemble can recover; this bounds achievable accuracy at 8pp below full ensembling",
    347       "evidence": "Section 6.2 explains 'agreement-but-wrong' as intrinsic to self-consistency; ACAR-U 55.6% vs Arena-3 63.6% = 8pp gap",
    348       "supported": "strong"
    349     },
    350     {
    351       "claim": "Attribution proxies (response similarity, entropy) showed weak correlation with ground-truth leave-one-out values",
    352       "evidence": "Section 6.3 states proxies 'showed weak correlation with ground-truth leave-one-out values; practical attribution requires explicit counterfactual computation'",
    353       "supported": "moderate"
    354     },
    355     {
    356       "claim": "σ-routing mechanism is model-agnostic and requires no learned components",
    357       "evidence": "Algorithm 1 shows deterministic routing based on σ; Section 3.2.3 justifies choice to avoid distribution shift. Only tested on 3 proprietary models.",
    358       "supported": "moderate"
    359     }
    360   ],
    361   "methodology_tags": [
    362     "benchmark-eval",
    363     "case-study"
    364   ],
    365   "key_findings": "ACAR proposes σ-based adaptive routing using self-consistency variance to allocate compute across multi-model ensembles. On 1,510 benchmark tasks, σ-routing achieves 55.6% accuracy—exceeding two-model ensembling by 1.2pp while avoiding full ensemble on 54% of tasks. The paper documents three critical failures: naive retrieval augmentation hurts (-3.4pp) without task-aligned semantic thresholds (>0.7); the algorithm has an irreducible 8pp ceiling when all models agree incorrectly; and post-hoc attribution from proxy signals does not correlate with ground truth.",
    366   "red_flags": [
    367     {
    368       "flag": "No contamination analysis",
    369       "detail": "Evaluates proprietary models (Claude, GPT-4o, Gemini) with unknown training cutoffs on benchmarks; does not discuss potential overlap between model training data and evaluation benchmarks."
    370     },
    371     {
    372       "flag": "No statistical significance testing",
    373       "detail": "Reports point estimates (55.6% vs 54.4%) without confidence intervals, p-values, or error bars. ACAR-U leads by only 1.2pp; statistical significance unclear given no multiple trials."
    374     },
    375     {
    376       "flag": "Deterministic runs without uncertainty quantification",
    377       "detail": "Single deterministic execution per configuration (by design). No repeated evaluations or confidence intervals. Uncertainty in model outputs not measured."
    378     },
    379     {
    380       "flag": "Prompts not disclosed",
    381       "detail": "Algorithm and hyperparameters described, but actual prompts/system instructions given to models are not provided. Prompt hashes logged but templates hidden, limiting reproducibility."
    382     },
    383     {
    384       "flag": "Limited baseline comparisons",
    385       "detail": "No comparison to other routing methods (RouterBench, FrugalGPT, RouteLLM) mentioned in related work. Only compared to naive ensemble baselines."
    386     },
    387     {
    388       "flag": "Weak evidence for attribution failure",
    389       "detail": "Section 6.3 claims attribution proxies 'showed weak correlation' but provides no figures, correlation coefficients, or detailed analysis. Minimal supporting evidence."
    390     },
    391     {
    392       "flag": "Benchmark dominance not controlled",
    393       "detail": "SuperGPQA comprises 66% of tasks (1,000/1,510); results heavily skewed toward knowledge-based multiple-choice. No stratified or weighted analysis."
    394     },
    395     {
    396       "flag": "Missing funding/affiliation disclosure",
    397       "detail": "No funding source stated. Single author with no institutional affiliation listed. Paper evaluates three major LLM providers; conflict-of-interest status unclear."
    398     }
    399   ],
    400   "cited_papers": [
    401     {
    402       "title": "RouterBench: A benchmark for multi-LLM routing system",
    403       "relevance": "Benchmark for evaluating LLM routing systems; ACAR differs by using heuristic σ-based routing instead of learned classifiers"
    404     },
    405     {
    406       "title": "FrugalGPT: How to use large language models while reducing cost and improving performance",
    407       "relevance": "Cascading cost-aware routing strategy; ACAR compares to this approach for cost-quality trade-offs"
    408     },
    409     {
    410       "title": "RouteLLM: Learning to route LLMs with preference data",
    411       "relevance": "Preference-learning based routing; ACAR explicitly avoids learned routers for interpretability"
    412     },
    413     {
    414       "title": "ReAct: Synergizing reasoning and acting in language models",
    415       "relevance": "Single-model agentic reasoning with tools; related to multi-model orchestration but focuses on tool use within one model"
    416     },
    417     {
    418       "title": "A survey on mixture of experts in large language models",
    419       "relevance": "Token-level routing within a single model; orthogonal to inter-model orchestration"
    420     },
    421     {
    422       "title": "LiveCodeBench: A challenging benchmark for code generation with execution-based verification",
    423       "relevance": "Execution-verified code evaluation benchmark used to evaluate ACAR routing on deterministic code tasks"
    424     },
    425     {
    426       "title": "The Shapley value in machine learning",
    427       "relevance": "Attribution method for assigning credit in multi-agent systems; ACAR discusses why Shapley-like proxies fail"
    428     }
    429   ],
    430   "engagement_factors": {
    431     "practical_relevance": {
    432       "score": 2,
    433       "justification": "Cost-quality trade-offs are immediately relevant to practitioners deploying multi-model LLM systems, but requires access to three different proprietary APIs simultaneously."
    434     },
    435     "surprise_contrarian": {
    436       "score": 2,
    437       "justification": "Challenges assumptions: retrieval augmentation with weak semantic alignment hurts (not helps), attribution from proxy signals doesn't work, and unanimous model agreement is an irreducible failure mode."
    438     },
    439     "fear_safety": {
    440       "score": 0,
    441       "justification": "Paper focuses on cost-efficiency and routing optimization. No AI safety, alignment, or risk concerns raised or addressed."
    442     },
    443     "drama_conflict": {
    444       "score": 1,
    445       "justification": "Technical contribution lacks narrative drama. 'Agreement-but-wrong' failure is intellectually interesting but not emotionally engaging."
    446     },
    447     "demo_ability": {
    448       "score": 2,
    449       "justification": "Code and artifacts released on GitHub; results are reproducible from provided runs.jsonl. However, requires API keys for three proprietary models, limiting who can reproduce live."
    450     },
    451     "brand_recognition": {
    452       "score": 1,
    453       "justification": "Single author with no stated affiliation. TEAMLLM is novel but not established brand. Paper likely to appeal to infra-focused practitioners rather than general audience."
    454     }
    455   },
    456   "hn_data": {
    457     "threads": [
    458       {
    459         "hn_id": "47154950",
    460         "title": "Aletheia Tackles FirstProof Autonomously",
    461         "points": 5,
    462         "comments": 0,
    463         "url": "https://news.ycombinator.com/item?id=47154950",
    464         "created_at": "2026-02-25T17:46:36Z"
    465       },
    466       {
    467         "hn_id": "47314080",
    468         "title": "Latent Context Compilation: Distilling Long Context into Compact Portable Memory",
    469         "points": 2,
    470         "comments": 0,
    471         "url": "https://news.ycombinator.com/item?id=47314080",
    472         "created_at": "2026-03-09T19:21:30Z"
    473       }
    474     ],
    475     "top_points": 5,
    476     "total_points": 7,
    477     "total_comments": 0
    478   }
    479 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs