238_025predictionAIAI-timing

AI computer-use benchmarks (OSWorld, Tbench) have broken through human level

Predictor: Emad Mostaque · ep#238 "Meta Buys Moltbook, GPT 5.4, and Fruitfly Brain Upload | Moonshots Live at The Abundance Summit 238" · source

Prior probability

55.0%

Current probability

44.7%

evolves via intake + LBP

Conviction

5/5

Signal quality

Resolution

pending

Window

2026-01-01 – 2026-11-30

Edges in / out

8 / 5

Tickers exposed

Prediction text

AI computer-use benchmarks (OSWorld, Tbench) have broken through human level | you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.

Verbatim quote

From episode "Meta Buys Moltbook, GPT 5.4, and Fruitfly Brain Upload | Moonshots Live at The Abundance Summit 238"

you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.

Predictor: Emad Mostaque

κ + Brier as of 2026-05-22

Full calibration →

κ (discount)

0.722

Brier

0.0073

excellent

Hits / Misses

3 / 0

of 4 resolved

Hit rate

75.0%

Calibration plot (stated vs observed)

Evidence about this node from Emad Mostaque is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).

Reference class

Not linked

This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.

Probability over time

4 prob_history rows

intake v2milestone miss sweeplbp propagationreference class assignedlegacy v1prior_prob (analyst seed)current = 44.7%

Milestone chain

Pre-event signals (upstream prereqs + window checkpoints) → resolution event → downstream cascades. Status/dates update from linked nodes; re-derive nightly via scripts/ops/derive_milestones.py.

Leading chain: 7 fired ✓ · 1 overdue ⏱

2026-02-28hitClaude Opus 4.6 narrowly exceeds OSWorld human baseline at 72.7%
How: Claude Opus 4.6 scores 72.7% on OSWorld vs 72.4% human-expert baseline, validating thesis crossover
Source: https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — Claude Opus 4.6 72.7%conf 95%
2026-03-05hitGPT-5.4 hits 75.0% on OSWorld — first clear superhuman margin
How: GPT-5.4 scores >=75% on OSWorld, exceeding human baseline by clear margin (>=2pp)
Source: https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — GPT-5.4 75.0%conf 95%
2026-04-15overdueStanford AI Index 2026 documents OSWorld 12% to 66.3% accuracy jump
How: Stanford AI Index 2026 published showing OSWorld accuracy gain from ~12% to >=66% in 12 months
Source: https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance — Stanford AI Indexconf 92%
2026-04-29hitNvidia quadrupled chip production output while only doubling human headcount — achieved by deploying AI coding tools (Cursor, Claude Code) a
2026-04-29hitTraining runs costing $10 billion for a single model will commence sometime in 2025.
2026-04-29hit2025 will be the definitive year that agentic systems finally hit the mainstream.
2026-04-29hitRecursive self-improvement is already happening now (no longer three years out)
2026-04-29hitHolo3-35B-A3B leads OSWorld-Verified at 82.6%
How: Holo3-35B-A3B scores 82.6% on OSWorld-Verified leaderboard — >10pp above human baseline
Source: https://benchlm.ai/benchmarks/osWorldVerified — Holo3 leaderboardconf 95%
2026-08-05pendingAI computer-use benchmarks (OSWorld, Tbench) have broken through human level
2026-06-01 → 2026-12-31pendingComputer-use agent on OSWorld reaches 90% human-task efficiency
How: Frontier computer-use agent achieves <=1.1x human-step count (currently 1.4x) at >=80% OSWorld accuracy
Source: https://arxiv.org/abs/2506.16042 — OSWorld-Human efficiency benchmarkconf 55%
2026-08-01 → 2027-06-30pendingCascade: Major enterprise SaaS deploys OSWorld-grade computer-use agent in production
How: At least 3 Fortune-500 enterprises announce production-grade computer-use agent automating >=10% of knowledge worker tasks
Source: Cascade from SOTA OSWorld scores driving enterprise rolloutconf 55%
2027-06-26pendingMath is cooked (will be solved), physics cooked, biology char broiled.
2028-06-25pendingWe're exiting the industrial age permanently as recursive self-improvement unfolds.
2028-09-07pendingBy 2028, AI systems will reach 'independent researcher' level — driving autonomous scientific discoveries without human intervention.
2033-07-30pendingRay Kurzweil predicts Longevity Escape Velocity (LEV) by 2033.
2033-08-10pendingASI will arrive within 2 years to 5 years to this next decade

What if this resolves?

Clamp this prediction TRUE or FALSE and run a counterfactual Gibbs sample. Surfaces the predictions whose marginals shift most under that assumption.

(live posterior: 45%)

Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"

Evidence chain

Every probability update with full Bayesian provenance — chronological, latest first

LBP2026-05-03T02:00:01Z44.7%+1.9pp

Network propagation: 42.8% → 44.7%

6-iter LBP, residual 0.00677 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 1a683ac9

metadata_milestone_miss_sweep2026-05-02T22:07:21Z42.8%-6.7pp

metadata_milestone_miss_sweep bayesian_v2 n=1 inside=0.428 blend=0.428 LLR=-0.269 κ=0.72 no_blend

Raw metadata

{
  "trf": 0.6338685427014515,
  "kappa": 0.7222,
  "base_rate": null,
  "predictor": "Emad Mostaque",
  "total_llr": -0.4054651081081644,
  "grace_days": 7,
  "bayesian_v2": true,
  "prior_logit": -0.0201207213393815,
  "bayes_factor": "1.3:1 against",
  "blend_reason": "no reference_class linked",
  "inside_prior": 0.4949699893612384,
  "kappa_source": "predictor_table",
  "n_milestones": 1,
  "blend_applied": false,
  "contributions": [
    {
      "llr": -0.4054651081081644,
      "kind": "llm_pre_event",
      "kappa": 0.664424,
      "label": "Stanford AI Index 2026 documents OSWorld 12% to 66.3% accuracy jump",
      "weight": 0.4,
      "strength": "weak",
      "confidence": 0.92,
      "source_url": "https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance",
      "adjusted_llr": -0.26940074898965904,
      "expected_date": "2026-04-15",
      "measurement_criterion": "Stanford AI Index 2026 published showing OSWorld accuracy gain from ~12% to >=66% in 12 months"
    }
  ],
  "evidence_kind": "metadata_milestone_miss_sweep",
  "inside_source": "history_v2",
  "inside_weight": 0.5562920201089838,
  "outside_weight": 0.4437079798910162,
  "posterior_prob": 0.4281210230889299,
  "posterior_logit": -0.2895214703290405,
  "predictor_brier": 0.0073,
  "inside_posterior": 0.4281210230889299,
  "blended_posterior": 0.4281210230889299,
  "reference_class_id": null,
  "total_adjusted_llr": -0.26940074898965904,
  "predictor_n_resolved": 4
}

LBP2026-04-30T16:39:51Z49.5%-2.1pp

Network propagation: 51.6% → 49.5%

5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v2 · run 0c8a4ea3

LBP2026-04-30T02:18:57Z51.6%-3.4pp

Network propagation: 55.0% → 51.6%

5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v1 · run 592311ef

Network propagation neighbors

Top edges sorted by latest LBP cross-impact

All propagation →

Top incoming (parents)

Edges that influence THIS node's belief

Kind	Node	Their prob	P(c\|s=T)	P(c\|s=F)	Δ implied
killer	TK03 AI Regulatory Moratorium (EU/US Capability Freeze)	10.0%	0.050	0.550	+0.053
killer	TK02 AI Compute Supply Shock (TSMC/Taiwan Disruption)	12.0%	0.050	0.550	+0.043
prereq	SEM_042 2025 will be the definitive year that agentic systems finall — Kevin Weil	73.8%	0.550	0.050	-0.033
killer	TK01 AGI Capability Plateau (2026-27 Training Stall)	15.0%	0.050	0.550	+0.028
prereq	SEM_012 Nvidia quadrupled chip production output while only doubling — Jensen Huang	75.0%	0.550	0.050	-0.026

Top outgoing (children)

Predictions THIS node influences

Kind	Node	Their prob	P(c\|s=T)	P(c\|s=F)	Δ implied
prereq	231_013 Math is cooked (will be solved), physics cooked, biology cha — Alex Wissner-Gross	35.4%	0.620	0.050	-0.045
prereq	CMQ_002 By 2028, AI systems will reach 'independent researcher' leve — Sam Altman	31.4%	0.550	0.050	-0.037
prereq	241_043 ASI will arrive within 2 years to 5 years to this next decad — Peter Diamandis	35.9%	0.650	0.050	-0.036
prereq	235_030 Ray Kurzweil predicts Longevity Escape Velocity (LEV) by 203 — Ray Kurzweil	39.2%	0.750	0.050	-0.025
prereq	232_055 We're exiting the industrial age permanently as recursive se — Peter Diamandis	35.5%	0.700	0.050	-0.010

Ticker exposure

33 ticker(s) linked

Beneficiaries (23)

SOUN CRWV SITM NVDA ARM GTLB BBAI TSM APLD CEVA AI MSFT MRVL SFTBY ORCL QCOM AVGO BABA AMD GOOGL IBM AMZN META

Adverse (6)

WNS CHGG CTSH IBM INFY ACN

Prerequisites (8)

Predictions that must hit first

Type	Pred	Title	Domain	Lag
prereq	SEM_008	Training runs costing $10 billion for a single model will commence sometime in 2025.	AI	—
prereq	238_009	Recursive self-improvement is already happening now (no longer three years out)	AI	—
prereq	SEM_012	Nvidia quadrupled chip production output while only doubling human headcount — achieved by deploying AI coding tools (Cursor, Claude Code) across engineering.	AI/Manufacturing	—
prereq	SEM_042	2025 will be the definitive year that agentic systems finally hit the mainstream.	AI/Agents	—
killer	TK14	Superbubble Pop (S&P 500 -40%, Moonshot Capital Evaporates)	—	—
killer	TK01	AGI Capability Plateau (2026-27 Training Stall)	—	—
killer	TK02	AI Compute Supply Shock (TSMC/Taiwan Disruption)	—	—
killer	TK03	AI Regulatory Moratorium (EU/US Capability Freeze)	—	—

Dependents (5)

Predictions enabled by this

Type	Pred	Title	Domain	Lag
prereq	235_030	Ray Kurzweil predicts Longevity Escape Velocity (LEV) by 2033.	Biotech/Longevity	—
prereq	232_055	We're exiting the industrial age permanently as recursive self-improvement unfolds.	AI	—
prereq	241_043	ASI will arrive within 2 years to 5 years to this next decade	AI	—
prereq	231_013	Math is cooked (will be solved), physics cooked, biology char broiled.	AI	—
prereq	CMQ_002	By 2028, AI systems will reach 'independent researcher' level — driving autonomous scientific discoveries without human intervention.	AI	—

Linked documents (10)

Auto-generated by cosine similarity from Polymarket / Manifold / EDGAR / GDELT

Sim	Source	Title	Market prob	Polarity	Reviewed	Published
0.686	arxiv	AcademiClaw: When Students Set Challenges for AI Agents	—	mentions	pending	2026-05-04
0.677	arxiv	Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies	—	mentions	pending	2026-05-05
0.666	arxiv	Offloading Score: Measuring AI Reliance Through Counterfactual Workflows	—	mentions	pending	2026-05-28
0.645	github_release	facebookresearch/neuroai v0.2.2	—	mentions	pending	2026-05-26
0.644	arxiv	Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection	—	mentions	pending	2026-06-04
0.635	github_release	openai/openai-python v2.26.0	—	mentions	pending	2026-03-05
0.633	github_release	tensorflow/tensorflow v2.16.1	—	mentions	pending	2024-03-07
0.627	arxiv	PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing	—	mentions	pending	2026-05-28
0.625	github_release	tensorflow/tensorflow v2.16.0-rc0	—	mentions	pending	2024-02-26
0.624	arxiv	A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation	—	mentions	pending	2026-06-01

Raw metadata

From Thesis_Timeline_v1.0_FINAL workbook

{
  "nia": false,
  "url": "https://www.youtube.com/watch?v=d__HRChE2ZE",
  "mode": "THESIS",
  "role": "Host",
  "context": "you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.",
  "to_year": 2026,
  "verbatim": "you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.",
  "conv_cues": "broken through; better than humans",
  "direction": "HAPPEN",
  "from_year": 2026,
  "timeframe": "Now",
  "conv_level": "HIGH",
  "milestones": [
    {
      "kind": "llm_pre_event",
      "label": "Claude Opus 4.6 narrowly exceeds OSWorld human baseline at 72.7%",
      "source": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — Claude Opus 4.6 72.7%",
      "status": "hit",
      "weight": 0.4,
      "ordinal": -8,
      "source_id": null,
      "confidence": 0.95,
      "source_url": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents",
      "expected_date": "2026-02-28",
      "observed_date": "2026-02-28",
      "research_origin": "deep_research",
      "measurement_criterion": "Claude Opus 4.6 scores 72.7% on OSWorld vs 72.4% human-expert baseline, validating thesis crossover"
    },
    {
      "kind": "llm_pre_event",
      "label": "GPT-5.4 hits 75.0% on OSWorld — first clear superhuman margin",
      "source": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — GPT-5.4 75.0%",
      "status": "hit",
      "weight": 0.4,
      "ordinal": -7,
      "source_id": null,
      "confidence": 0.95,
      "source_url": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents",
      "expected_date": "2026-03-05",
      "observed_date": "2026-03-05",
      "research_origin": "deep_research",
      "measurement_criterion": "GPT-5.4 scores >=75% on OSWorld, exceeding human baseline by clear margin (>=2pp)"
    },
    {
      "kind": "llm_pre_event",
      "label": "Stanford AI Index 2026 documents OSWorld 12% to 66.3% accuracy jump",
      "source": "https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance — Stanford AI Index",
      "status": "overdue",
      "weight": 0.4,
      "ordinal": -6,
      "source_id": null,
      "confidence": 0.92,
      "source_url": "https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance",
      "expected_date": "2026-04-15",
      "miss_emitted_at": "2026-05-02T22:07:21.384228+00:00",
      "miss_emitted_by": "metadata_milestone_sweep",
      "research_origin": "deep_research",
      "measurement_criterion": "Stanford AI Index 2026 published showing OSWorld accuracy gain from ~12% to >=66% in 12 months"
    },
    {
      "kind": "prereq",
      "label": "Nvidia quadrupled chip production output while only doubling human headcount — achieved by deploying AI coding tools (Cursor, Claude Code) a",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -5,
      "source_id": "SEM_012",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "prereq",
      "label": "Training runs costing $10 billion for a single model will commence sometime in 2025.",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -4,
      "source_id": "SEM_008",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "prereq",
      "label": "2025 will be the definitive year that agentic systems finally hit the mainstream.",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -3,
      "source_id": "SEM_042",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "prereq",
      "label": "Recursive self-improvement is already happening now (no longer t
... (truncated)