238_022predictionAIAI-timing

From here forward, training data will be synthetic (pre-training era of human internet data is over)

Predictor: Alex Wissner-Gross · ep#238 "Meta Buys Moltbook, GPT 5.4, and Fruitfly Brain Upload | Moonshots Live at The Abundance Summit 238" · source

Prior probability

50.0%

Current probability

39.5%

evolves via intake + LBP

Conviction

4/5

Signal quality

Resolution

pending

Window

2026-04-30 – 2029-03-31

Edges in / out

7 / 0

Tickers exposed

Prediction text

From here forward, training data will be synthetic (pre-training era of human internet data is over) | There was no data ceiling. It was completely elucory... We we've reached orbit. We've reached escape velocity and now it's synthetic data from here on out.

Verbatim quote

From episode "Meta Buys Moltbook, GPT 5.4, and Fruitfly Brain Upload | Moonshots Live at The Abundance Summit 238"

There was no data ceiling. It was completely elucory... We we've reached orbit. We've reached escape velocity and now it's synthetic data from here on out.

Predictor: Alex Wissner-Gross

κ + Brier as of 2026-05-22

Full calibration →

κ (discount)

0.844

Brier

0.0341

excellent

Hits / Misses

6 / 1

of 11 resolved

Hit rate

54.5%

Calibration plot (stated vs observed)

Evidence about this node from Alex Wissner-Gross is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).

Reference class

Not linked

This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.

Probability over time

4 prob_history rows

intake v2milestone miss sweeplbp propagationreference class assignedlegacy v1prior_prob (analyst seed)current = 39.5%

Milestone chain

Pre-event signals (upstream prereqs + window checkpoints) → resolution event → downstream cascades. Status/dates update from linked nodes; re-derive nightly via scripts/ops/derive_milestones.py.

Leading chain: 1 fired ✓ · 8 pending

2025-10-31hitMicrosoft Research SynthLLM and arXiv 2510.01631 establish synthetic-data scaling laws
How: Peer-reviewed/preprint research establishes synthetic-data scaling laws for LLM pre-training with quantified mix ratios
Source: https://arxiv.org/abs/2510.01631conf 95%
Notes: HIT — Microsoft Research published SynthLLM and 'Demystifying Synthetic Data in LLM Pre-training' establishing 33% synthetic / 67% natural as optimal mix.
2026-11-01pendingQ1 window check-in (25%)
2026-06-01 → 2027-12-31pendingFrontier lab publicly confirms majority synthetic data in latest model pre-training
How: OpenAI/Anthropic/Google DeepMind technical report or paper states >=50% of pre-training tokens for flagship model are synthetic
Source: https://invisibletech.ai/blog/ai-training-in-2026-anchoring-synthetic-data-in-human-truthconf 45%
Notes: Wissner-Gross's strong claim that pre-training is 'completely synthetic from here on out' — current research suggests 33% mix is near-optimal, not 100%.
2026-06-01 → 2027-12-31pendingEpoch AI confirms exhaustion of high-quality public text corpus
How: Epoch AI or equivalent research org publishes update confirming utilization of stock of human-generated public text
Source: https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-dataconf 70%
2027-05-05pendingQ2 window check-in (50%)
2026-06-01 → 2028-12-31pendingFrontier labs continue licensing/scraping fresh human data through 2027+
How: Frontier labs continue announcing major data licensing deals (publishers, social, video) through 2027 — implying human data is NOT obsolete
Source: https://odsc.medium.com/the-top-10-llm-training-datasets-for-2026-40578afa9f89conf 85%
Notes: Counter-evidence — would partially refute the 'pre-training era over' framing.
2027-09-30pendingScenario fires: AGI fast: drop-in remote worker by 2027-09
2027-11-06pendingQ3 window check-in (75%)
2027-01-01 → 2028-12-31pendingPure synthetic-data pre-training run produces frontier-class model
How: >=1 frontier-class model (top-5 leaderboard) trained with >90% synthetic pre-training tokens publicly released with technical report
Source: https://www.microsoft.com/en-us/research/articles/synthllm-breaking-the-ai-data-wall-with-scalable-synthetic-data/conf 30%
2028-05-10pendingFrom here forward, training data will be synthetic (pre-training era of human internet data is over)

No downstream cascades — this prediction is a leaf in the dependency graph.

What if this resolves?

Clamp this prediction TRUE or FALSE and run a counterfactual Gibbs sample. Surfaces the predictions whose marginals shift most under that assumption.

(live posterior: 39%)

Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"

Evidence chain

Every probability update with full Bayesian provenance — chronological, latest first

LBP2026-05-10T02:00:02Z39.5%-1.8pp

Network propagation: 41.3% → 39.5%

6-iter LBP, residual 0.00584 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run e5c18d29

LBP2026-05-03T02:00:01Z41.3%-3.6pp

Network propagation: 44.8% → 41.3%

6-iter LBP, residual 0.00677 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 1a683ac9

LBP2026-04-30T16:39:51Z44.8%-1.9pp

Network propagation: 46.7% → 44.8%

5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v2 · run 0c8a4ea3

LBP2026-04-30T02:18:57Z46.7%-3.3pp

Network propagation: 50.0% → 46.7%

5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v1 · run 592311ef

Network propagation neighbors

Top edges sorted by latest LBP cross-impact

All propagation →

Top incoming (parents)

Edges that influence THIS node's belief

Kind	Node	Their prob	P(c\|s=T)	P(c\|s=F)	Δ implied
prereq	S_AGI_MID_2029 AGI mid: Kurzweil 2029 path	35.0%	0.500	0.050	-0.187
killer	TK03 AI Regulatory Moratorium (EU/US Capability Freeze)	10.0%	0.050	0.500	+0.060
killer	TK01 AGI Capability Plateau (2026-27 Training Stall)	15.0%	0.050	0.500	+0.038
killer	TK14 Superbubble Pop (S&P 500 -40%, Moonshot Capital Evaporates)	20.0%	0.050	0.500	+0.015

Top outgoing (children)

Predictions THIS node influences

No outgoing edges.

Ticker exposure

33 ticker(s) linked

Beneficiaries (23)

SOUN CRWV SITM NVDA ARM GTLB BBAI TSM APLD CEVA AI MSFT MRVL SFTBY ORCL QCOM AVGO BABA AMD GOOGL IBM AMZN META

Adverse (6)

WNS CHGG CTSH IBM INFY ACN

Prerequisites (7)

Predictions that must hit first

Type	Pred	Title	Domain	Lag
prereq	S_AGI_MID_2029	AGI mid: Kurzweil 2029 path	agi_general_capability	—
correlate	S_AGI_FAST_2027	AGI fast: drop-in remote worker by 2027-09	agi_general_capability	—
correlate	S_AGI_SLOW_2031	AGI slow: Schmidt/Hassabis 5-10 year path	agi_general_capability	—
correlate	S_AGI_WINTER_2036PLUS	AGI delayed: capability plateau or AI winter	agi_general_capability	—
killer	TK14	Superbubble Pop (S&P 500 -40%, Moonshot Capital Evaporates)	—	—
killer	TK01	AGI Capability Plateau (2026-27 Training Stall)	—	—
killer	TK03	AI Regulatory Moratorium (EU/US Capability Freeze)	—	—

Dependents (0)

Predictions enabled by this

Type	Pred	Title	Domain	Lag
No dependents

Linked documents (9)

Auto-generated by cosine similarity from Polymarket / Manifold / EDGAR / GDELT

Sim	Source	Title	Market prob	Polarity	Reviewed	Published
0.647	arxiv	Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters	—	mentions	pending	2026-05-07
0.638	arxiv	Practical validation of synthetic pre-crash scenarios	—	mentions	pending	2026-05-06
0.635	arxiv	It does what it says on the tin: safe synthetic data from coarsened margins	—	mentions	pending	2026-06-01
0.613	github_release	facebookresearch/AudioDec pretrain_models_v02	—	mentions	pending	2024-01-03
0.594	gdelt	the era of the pilot is over the era of the agent is here google cloud wants you to unlock the power of your data	—	mentions	pending	2026-04-30
0.576	github_release	facebookresearch/balance 0.5.0	—	mentions	pending	2023-03-08
0.568	github_release	facebookresearch/balance 0.3.1	—	mentions	pending	2023-02-01
0.567	github_release	facebookresearch/Replica-Dataset v1.0	—	mentions	pending	2019-06-14
0.563	github_release	facebookresearch/sound-spaces v0.1.1	—	mentions	pending	2021-02-22

Raw metadata

From Thesis_Timeline_v1.0_FINAL workbook

{
  "nia": false,
  "url": "https://www.youtube.com/watch?v=d__HRChE2ZE",
  "mode": "THESIS",
  "role": "Host",
  "context": "There was no data ceiling. It was completely elucory. And I I think history will look back at this moment... Similarly, the internet which was collected by a bunch of fat fingers punching keyboards... was just the biological bootloadader for an era of synthetic data... We've reached escape velocity and now it's synthetic data from here on out.",
  "verbatim": "There was no data ceiling. It was completely elucory... We we've reached orbit. We've reached escape velocity and now it's synthetic data from here on out.",
  "conv_cues": "no data ceiling; escape velocity",
  "direction": "HAPPEN",
  "timeframe": "Ongoing",
  "conv_level": "HIGH",
  "milestones": [
    {
      "kind": "llm_pre_event",
      "label": "Microsoft Research SynthLLM and arXiv 2510.01631 establish synthetic-data scaling laws",
      "notes": "HIT — Microsoft Research published SynthLLM and 'Demystifying Synthetic Data in LLM Pre-training' establishing 33% synthetic / 67% natural as optimal mix.",
      "source": "https://arxiv.org/abs/2510.01631",
      "status": "hit",
      "weight": 0.4,
      "ordinal": -9,
      "source_id": null,
      "confidence": 0.95,
      "source_url": "https://arxiv.org/abs/2510.01631",
      "expected_date": "2025-10-31",
      "observed_date": "2025-10-31",
      "research_origin": "deep_research",
      "measurement_criterion": "Peer-reviewed/preprint research establishes synthetic-data scaling laws for LLM pre-training with quantified mix ratios"
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q1 window check-in (25%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -8,
      "source_id": null,
      "expected_date": "2026-11-01",
      "observed_date": null
    },
    {
      "kind": "llm_pre_event",
      "label": "Frontier lab publicly confirms majority synthetic data in latest model pre-training",
      "notes": "Wissner-Gross's strong claim that pre-training is 'completely synthetic from here on out' — current research suggests 33% mix is near-optimal, not 100%.",
      "source": "https://invisibletech.ai/blog/ai-training-in-2026-anchoring-synthetic-data-in-human-truth",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -7,
      "source_id": null,
      "confidence": 0.45,
      "source_url": "https://invisibletech.ai/blog/ai-training-in-2026-anchoring-synthetic-data-in-human-truth",
      "expected_date": "2027-03-17",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2027-12-31",
        "from": "2026-06-01"
      },
      "measurement_criterion": "OpenAI/Anthropic/Google DeepMind technical report or paper states >=50% of pre-training tokens for flagship model are synthetic"
    },
    {
      "kind": "llm_pre_event",
      "label": "Epoch AI confirms exhaustion of high-quality public text corpus",
      "source": "https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -6,
      "source_id": null,
      "confidence": 0.7,
      "source_url": "https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data",
      "expected_date": "2027-03-17",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2027-12-31",
        "from": "2026-06-01"
      },
      "measurement_criterion": "Epoch AI or equivalent research org publishes update confirming utilization of stock of human-generated public text"
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q2 window check-in (50%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -5,
      "source_id": null,
      "expected_date": "2027-05-05",
      "observed_date": null
    },
    {
      "kind": "llm_post_event",
      "label": "Frontier labs continue licensi
... (truncated)