← Cockpit
238_022predictionAIAI-timing

From here forward, training data will be synthetic (pre-training era of human internet data is over)

Predictor: Alex Wissner-Gross · ep#238 "Meta Buys Moltbook, GPT 5.4, and Fruitfly Brain Upload | Moonshots Live at The Abundance Summit 238" · source

Prior probability
50.0%
Current probability
39.5%
evolves via intake + LBP
Conviction
4/5
Signal quality
B
Resolution
pending
Window
2026-04-30 – 2029-03-31
Edges in / out
7 / 0
Tickers exposed
33

Prediction text

From here forward, training data will be synthetic (pre-training era of human internet data is over) | There was no data ceiling. It was completely elucory... We we've reached orbit. We've reached escape velocity and now it's synthetic data from here on out.

Verbatim quote

From episode "Meta Buys Moltbook, GPT 5.4, and Fruitfly Brain Upload | Moonshots Live at The Abundance Summit 238"
There was no data ceiling. It was completely elucory... We we've reached orbit. We've reached escape velocity and now it's synthetic data from here on out.

Predictor: Alex Wissner-Gross

κ + Brier as of 2026-05-22
κ (discount)
0.844
Brier
0.0341
excellent
Hits / Misses
6 / 1
of 11 resolved
Hit rate
54.5%
Calibration plot (stated vs observed)

Evidence about this node from Alex Wissner-Gross is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).

Reference class

Not linked

This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.

Probability over time

4 prob_history rows
0%25%50%75%100%prior 50%2026-04-302026-05-032026-05-10
intake v2milestone miss sweeplbp propagationreference class assignedlegacy v1prior_prob (analyst seed)current = 39.5%

Milestone chain

Pre-event signals (upstream prereqs + window checkpoints) → resolution event → downstream cascades. Status/dates update from linked nodes; re-derive nightly via scripts/ops/derive_milestones.py.
Leading chain: 1 fired ✓ · 8 pending
  1. 2025-10-31hitMicrosoft Research SynthLLM and arXiv 2510.01631 establish synthetic-data scaling laws
    How: Peer-reviewed/preprint research establishes synthetic-data scaling laws for LLM pre-training with quantified mix ratios
    Source: https://arxiv.org/abs/2510.01631conf 95%
    Notes: HIT — Microsoft Research published SynthLLM and 'Demystifying Synthetic Data in LLM Pre-training' establishing 33% synthetic / 67% natural as optimal mix.
  2. 2026-11-01pendingQ1 window check-in (25%)
  3. 2026-06-01 → 2027-12-31pendingFrontier lab publicly confirms majority synthetic data in latest model pre-training
    How: OpenAI/Anthropic/Google DeepMind technical report or paper states >=50% of pre-training tokens for flagship model are synthetic
    Source: https://invisibletech.ai/blog/ai-training-in-2026-anchoring-synthetic-data-in-human-truthconf 45%
    Notes: Wissner-Gross's strong claim that pre-training is 'completely synthetic from here on out' — current research suggests 33% mix is near-optimal, not 100%.
  4. 2026-06-01 → 2027-12-31pendingEpoch AI confirms exhaustion of high-quality public text corpus
    How: Epoch AI or equivalent research org publishes update confirming utilization of stock of human-generated public text
    Source: https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-dataconf 70%
  5. 2027-05-05pendingQ2 window check-in (50%)
  6. 2026-06-01 → 2028-12-31pendingFrontier labs continue licensing/scraping fresh human data through 2027+
    How: Frontier labs continue announcing major data licensing deals (publishers, social, video) through 2027 — implying human data is NOT obsolete
    Source: https://odsc.medium.com/the-top-10-llm-training-datasets-for-2026-40578afa9f89conf 85%
    Notes: Counter-evidence — would partially refute the 'pre-training era over' framing.
  7. 2027-11-06pendingQ3 window check-in (75%)
  8. 2027-01-01 → 2028-12-31pendingPure synthetic-data pre-training run produces frontier-class model
    How: >=1 frontier-class model (top-5 leaderboard) trained with >90% synthetic pre-training tokens publicly released with technical report
    Source: https://www.microsoft.com/en-us/research/articles/synthllm-breaking-the-ai-data-wall-with-scalable-synthetic-data/conf 30%

No downstream cascades — this prediction is a leaf in the dependency graph.

What if this resolves?

Clamp this prediction TRUE or FALSE and run a counterfactual Gibbs sample. Surfaces the predictions whose marginals shift most under that assumption.
(live posterior: 39%)

Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"

Evidence chain

Every probability update with full Bayesian provenance — chronological, latest first
LBP2026-05-10T02:00:02Z39.5%-1.8pp
Network propagation: 41.3% → 39.5%
6-iter LBP, residual 0.00584 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run e5c18d29
LBP2026-05-03T02:00:01Z41.3%-3.6pp
Network propagation: 44.8% → 41.3%
6-iter LBP, residual 0.00677 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 1a683ac9
LBP2026-04-30T16:39:51Z44.8%-1.9pp
Network propagation: 46.7% → 44.8%
5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v2 · run 0c8a4ea3
LBP2026-04-30T02:18:57Z46.7%-3.3pp
Network propagation: 50.0% → 46.7%
5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v1 · run 592311ef

Network propagation neighbors

Top edges sorted by latest LBP cross-impact
All propagation →

Top incoming (parents)

Edges that influence THIS node's belief

KindNodeTheir probP(c|s=T)P(c|s=F)Δ implied
prereqS_AGI_MID_2029
AGI mid: Kurzweil 2029 path
35.0%0.5000.050-0.187
killerTK03
AI Regulatory Moratorium (EU/US Capability Freeze)
10.0%0.0500.500+0.060
killerTK01
AGI Capability Plateau (2026-27 Training Stall)
15.0%0.0500.500+0.038
killerTK14
Superbubble Pop (S&P 500 -40%, Moonshot Capital Evaporates)
20.0%0.0500.500+0.015

Top outgoing (children)

Predictions THIS node influences

No outgoing edges.

Ticker exposure

33 ticker(s) linked

Beneficiaries (23)

SOUNCRWVSITMNVDAARMGTLBBBAITSMAPLDCEVAAIMSFTMRVLSFTBYORCLQCOMAVGOBABAAMDGOOGLIBMAMZNMETA

Adverse (6)

WNSCHGGCTSHIBMINFYACN

Prerequisites (7)

Predictions that must hit first
TypePredTitleDomainLag
prereqS_AGI_MID_2029AGI mid: Kurzweil 2029 pathagi_general_capability
correlateS_AGI_FAST_2027AGI fast: drop-in remote worker by 2027-09agi_general_capability
correlateS_AGI_SLOW_2031AGI slow: Schmidt/Hassabis 5-10 year pathagi_general_capability
correlateS_AGI_WINTER_2036PLUSAGI delayed: capability plateau or AI winteragi_general_capability
killerTK14Superbubble Pop (S&P 500 -40%, Moonshot Capital Evaporates)
killerTK01AGI Capability Plateau (2026-27 Training Stall)
killerTK03AI Regulatory Moratorium (EU/US Capability Freeze)

Dependents (0)

Predictions enabled by this
TypePredTitleDomainLag
No dependents

Linked documents (9)

Auto-generated by cosine similarity from Polymarket / Manifold / EDGAR / GDELT
SimSourceTitleMarket probPolarityReviewedPublished
0.647arxivDoes Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecastersmentionspending2026-05-07
0.638arxivPractical validation of synthetic pre-crash scenariosmentionspending2026-05-06
0.635arxivIt does what it says on the tin: safe synthetic data from coarsened marginsmentionspending2026-06-01
0.613github_releasefacebookresearch/AudioDec pretrain_models_v02mentionspending2024-01-03
0.594gdeltthe era of the pilot is over the era of the agent is here google cloud wants you to unlock the power of your datamentionspending2026-04-30
0.576github_releasefacebookresearch/balance 0.5.0mentionspending2023-03-08
0.568github_releasefacebookresearch/balance 0.3.1mentionspending2023-02-01
0.567github_releasefacebookresearch/Replica-Dataset v1.0mentionspending2019-06-14
0.563github_releasefacebookresearch/sound-spaces v0.1.1mentionspending2021-02-22

Raw metadata

From Thesis_Timeline_v1.0_FINAL workbook
{
  "nia": false,
  "url": "https://www.youtube.com/watch?v=d__HRChE2ZE",
  "mode": "THESIS",
  "role": "Host",
  "context": "There was no data ceiling. It was completely elucory. And I I think history will look back at this moment... Similarly, the internet which was collected by a bunch of fat fingers punching keyboards... was just the biological bootloadader for an era of synthetic data... We've reached escape velocity and now it's synthetic data from here on out.",
  "verbatim": "There was no data ceiling. It was completely elucory... We we've reached orbit. We've reached escape velocity and now it's synthetic data from here on out.",
  "conv_cues": "no data ceiling; escape velocity",
  "direction": "HAPPEN",
  "timeframe": "Ongoing",
  "conv_level": "HIGH",
  "milestones": [
    {
      "kind": "llm_pre_event",
      "label": "Microsoft Research SynthLLM and arXiv 2510.01631 establish synthetic-data scaling laws",
      "notes": "HIT — Microsoft Research published SynthLLM and 'Demystifying Synthetic Data in LLM Pre-training' establishing 33% synthetic / 67% natural as optimal mix.",
      "source": "https://arxiv.org/abs/2510.01631",
      "status": "hit",
      "weight": 0.4,
      "ordinal": -9,
      "source_id": null,
      "confidence": 0.95,
      "source_url": "https://arxiv.org/abs/2510.01631",
      "expected_date": "2025-10-31",
      "observed_date": "2025-10-31",
      "research_origin": "deep_research",
      "measurement_criterion": "Peer-reviewed/preprint research establishes synthetic-data scaling laws for LLM pre-training with quantified mix ratios"
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q1 window check-in (25%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -8,
      "source_id": null,
      "expected_date": "2026-11-01",
      "observed_date": null
    },
    {
      "kind": "llm_pre_event",
      "label": "Frontier lab publicly confirms majority synthetic data in latest model pre-training",
      "notes": "Wissner-Gross's strong claim that pre-training is 'completely synthetic from here on out' — current research suggests 33% mix is near-optimal, not 100%.",
      "source": "https://invisibletech.ai/blog/ai-training-in-2026-anchoring-synthetic-data-in-human-truth",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -7,
      "source_id": null,
      "confidence": 0.45,
      "source_url": "https://invisibletech.ai/blog/ai-training-in-2026-anchoring-synthetic-data-in-human-truth",
      "expected_date": "2027-03-17",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2027-12-31",
        "from": "2026-06-01"
      },
      "measurement_criterion": "OpenAI/Anthropic/Google DeepMind technical report or paper states >=50% of pre-training tokens for flagship model are synthetic"
    },
    {
      "kind": "llm_pre_event",
      "label": "Epoch AI confirms exhaustion of high-quality public text corpus",
      "source": "https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -6,
      "source_id": null,
      "confidence": 0.7,
      "source_url": "https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data",
      "expected_date": "2027-03-17",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2027-12-31",
        "from": "2026-06-01"
      },
      "measurement_criterion": "Epoch AI or equivalent research org publishes update confirming utilization of stock of human-generated public text"
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q2 window check-in (50%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -5,
      "source_id": null,
      "expected_date": "2027-05-05",
      "observed_date": null
    },
    {
      "kind": "llm_post_event",
      "label": "Frontier labs continue licensi
... (truncated)