From here forward, training data will be synthetic (pre-training era of human internet data is over)
Predictor: Alex Wissner-Gross · ep#238 "Meta Buys Moltbook, GPT 5.4, and Fruitfly Brain Upload | Moonshots Live at The Abundance Summit 238" · source
Prediction text
From here forward, training data will be synthetic (pre-training era of human internet data is over) | There was no data ceiling. It was completely elucory... We we've reached orbit. We've reached escape velocity and now it's synthetic data from here on out.
Verbatim quote
There was no data ceiling. It was completely elucory... We we've reached orbit. We've reached escape velocity and now it's synthetic data from here on out.
Predictor: Alex Wissner-Gross
Calibration plot (stated vs observed)
Evidence about this node from Alex Wissner-Gross is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).
Reference class
This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.
Probability over time
Milestone chain
- 2025-10-31hitMicrosoft Research SynthLLM and arXiv 2510.01631 establish synthetic-data scaling lawsHow: Peer-reviewed/preprint research establishes synthetic-data scaling laws for LLM pre-training with quantified mix ratiosSource: https://arxiv.org/abs/2510.01631conf 95%Notes: HIT — Microsoft Research published SynthLLM and 'Demystifying Synthetic Data in LLM Pre-training' establishing 33% synthetic / 67% natural as optimal mix.
- 2026-11-01pendingQ1 window check-in (25%)
- 2026-06-01 → 2027-12-31pendingFrontier lab publicly confirms majority synthetic data in latest model pre-trainingHow: OpenAI/Anthropic/Google DeepMind technical report or paper states >=50% of pre-training tokens for flagship model are syntheticSource: https://invisibletech.ai/blog/ai-training-in-2026-anchoring-synthetic-data-in-human-truthconf 45%Notes: Wissner-Gross's strong claim that pre-training is 'completely synthetic from here on out' — current research suggests 33% mix is near-optimal, not 100%.
- 2026-06-01 → 2027-12-31pendingEpoch AI confirms exhaustion of high-quality public text corpusHow: Epoch AI or equivalent research org publishes update confirming utilization of stock of human-generated public textSource: https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-dataconf 70%
- 2027-05-05pendingQ2 window check-in (50%)
- 2026-06-01 → 2028-12-31pendingFrontier labs continue licensing/scraping fresh human data through 2027+How: Frontier labs continue announcing major data licensing deals (publishers, social, video) through 2027 — implying human data is NOT obsoleteSource: https://odsc.medium.com/the-top-10-llm-training-datasets-for-2026-40578afa9f89conf 85%Notes: Counter-evidence — would partially refute the 'pre-training era over' framing.
- 2027-11-06pendingQ3 window check-in (75%)
- 2027-01-01 → 2028-12-31pendingPure synthetic-data pre-training run produces frontier-class modelHow: >=1 frontier-class model (top-5 leaderboard) trained with >90% synthetic pre-training tokens publicly released with technical reportSource: https://www.microsoft.com/en-us/research/articles/synthllm-breaking-the-ai-data-wall-with-scalable-synthetic-data/conf 30%
No downstream cascades — this prediction is a leaf in the dependency graph.
What if this resolves?
Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"
Evidence chain
Network propagation neighbors
Top incoming (parents)
Edges that influence THIS node's belief
| Kind | Node | Their prob | P(c|s=T) | P(c|s=F) | Δ implied |
|---|---|---|---|---|---|
| prereq | S_AGI_MID_2029 AGI mid: Kurzweil 2029 path | 35.0% | 0.500 | 0.050 | -0.187 |
| killer | TK03 AI Regulatory Moratorium (EU/US Capability Freeze) | 10.0% | 0.050 | 0.500 | +0.060 |
| killer | TK01 AGI Capability Plateau (2026-27 Training Stall) | 15.0% | 0.050 | 0.500 | +0.038 |
| killer | TK14 Superbubble Pop (S&P 500 -40%, Moonshot Capital Evaporates) | 20.0% | 0.050 | 0.500 | +0.015 |
Top outgoing (children)
Predictions THIS node influences
No outgoing edges.
Ticker exposure
Beneficiaries (23)
Adverse (6)
Prerequisites (7)
| Type | Pred | Title | Domain | Lag |
|---|---|---|---|---|
| prereq | S_AGI_MID_2029 | AGI mid: Kurzweil 2029 path | agi_general_capability | — |
| correlate | S_AGI_FAST_2027 | AGI fast: drop-in remote worker by 2027-09 | agi_general_capability | — |
| correlate | S_AGI_SLOW_2031 | AGI slow: Schmidt/Hassabis 5-10 year path | agi_general_capability | — |
| correlate | S_AGI_WINTER_2036PLUS | AGI delayed: capability plateau or AI winter | agi_general_capability | — |
| killer | TK14 | Superbubble Pop (S&P 500 -40%, Moonshot Capital Evaporates) | — | — |
| killer | TK01 | AGI Capability Plateau (2026-27 Training Stall) | — | — |
| killer | TK03 | AI Regulatory Moratorium (EU/US Capability Freeze) | — | — |
Dependents (0)
| Type | Pred | Title | Domain | Lag |
|---|---|---|---|---|
| No dependents | ||||
Linked documents (9)
| Sim | Source | Title | Market prob | Polarity | Reviewed | Published |
|---|---|---|---|---|---|---|
| 0.647 | arxiv | Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters | — | mentions | pending | 2026-05-07 |
| 0.638 | arxiv | Practical validation of synthetic pre-crash scenarios | — | mentions | pending | 2026-05-06 |
| 0.635 | arxiv | It does what it says on the tin: safe synthetic data from coarsened margins | — | mentions | pending | 2026-06-01 |
| 0.613 | github_release | facebookresearch/AudioDec pretrain_models_v02 | — | mentions | pending | 2024-01-03 |
| 0.594 | gdelt | the era of the pilot is over the era of the agent is here google cloud wants you to unlock the power of your data | — | mentions | pending | 2026-04-30 |
| 0.576 | github_release | facebookresearch/balance 0.5.0 | — | mentions | pending | 2023-03-08 |
| 0.568 | github_release | facebookresearch/balance 0.3.1 | — | mentions | pending | 2023-02-01 |
| 0.567 | github_release | facebookresearch/Replica-Dataset v1.0 | — | mentions | pending | 2019-06-14 |
| 0.563 | github_release | facebookresearch/sound-spaces v0.1.1 | — | mentions | pending | 2021-02-22 |
Raw metadata
{
"nia": false,
"url": "https://www.youtube.com/watch?v=d__HRChE2ZE",
"mode": "THESIS",
"role": "Host",
"context": "There was no data ceiling. It was completely elucory. And I I think history will look back at this moment... Similarly, the internet which was collected by a bunch of fat fingers punching keyboards... was just the biological bootloadader for an era of synthetic data... We've reached escape velocity and now it's synthetic data from here on out.",
"verbatim": "There was no data ceiling. It was completely elucory... We we've reached orbit. We've reached escape velocity and now it's synthetic data from here on out.",
"conv_cues": "no data ceiling; escape velocity",
"direction": "HAPPEN",
"timeframe": "Ongoing",
"conv_level": "HIGH",
"milestones": [
{
"kind": "llm_pre_event",
"label": "Microsoft Research SynthLLM and arXiv 2510.01631 establish synthetic-data scaling laws",
"notes": "HIT — Microsoft Research published SynthLLM and 'Demystifying Synthetic Data in LLM Pre-training' establishing 33% synthetic / 67% natural as optimal mix.",
"source": "https://arxiv.org/abs/2510.01631",
"status": "hit",
"weight": 0.4,
"ordinal": -9,
"source_id": null,
"confidence": 0.95,
"source_url": "https://arxiv.org/abs/2510.01631",
"expected_date": "2025-10-31",
"observed_date": "2025-10-31",
"research_origin": "deep_research",
"measurement_criterion": "Peer-reviewed/preprint research establishes synthetic-data scaling laws for LLM pre-training with quantified mix ratios"
},
{
"kind": "quartile_checkpoint",
"label": "Q1 window check-in (25%)",
"status": "pending",
"weight": 0.05,
"ordinal": -8,
"source_id": null,
"expected_date": "2026-11-01",
"observed_date": null
},
{
"kind": "llm_pre_event",
"label": "Frontier lab publicly confirms majority synthetic data in latest model pre-training",
"notes": "Wissner-Gross's strong claim that pre-training is 'completely synthetic from here on out' — current research suggests 33% mix is near-optimal, not 100%.",
"source": "https://invisibletech.ai/blog/ai-training-in-2026-anchoring-synthetic-data-in-human-truth",
"status": "pending",
"weight": 0.4,
"ordinal": -7,
"source_id": null,
"confidence": 0.45,
"source_url": "https://invisibletech.ai/blog/ai-training-in-2026-anchoring-synthetic-data-in-human-truth",
"expected_date": "2027-03-17",
"research_origin": "deep_research",
"expected_date_range": {
"to": "2027-12-31",
"from": "2026-06-01"
},
"measurement_criterion": "OpenAI/Anthropic/Google DeepMind technical report or paper states >=50% of pre-training tokens for flagship model are synthetic"
},
{
"kind": "llm_pre_event",
"label": "Epoch AI confirms exhaustion of high-quality public text corpus",
"source": "https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data",
"status": "pending",
"weight": 0.4,
"ordinal": -6,
"source_id": null,
"confidence": 0.7,
"source_url": "https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data",
"expected_date": "2027-03-17",
"research_origin": "deep_research",
"expected_date_range": {
"to": "2027-12-31",
"from": "2026-06-01"
},
"measurement_criterion": "Epoch AI or equivalent research org publishes update confirming utilization of stock of human-generated public text"
},
{
"kind": "quartile_checkpoint",
"label": "Q2 window check-in (50%)",
"status": "pending",
"weight": 0.05,
"ordinal": -5,
"source_id": null,
"expected_date": "2027-05-05",
"observed_date": null
},
{
"kind": "llm_post_event",
"label": "Frontier labs continue licensi
... (truncated)