← Cockpit
238_025predictionAIAI-timing

AI computer-use benchmarks (OSWorld, Tbench) have broken through human level

Predictor: Emad Mostaque · ep#238 "Meta Buys Moltbook, GPT 5.4, and Fruitfly Brain Upload | Moonshots Live at The Abundance Summit 238" · source

Prior probability
55.0%
Current probability
44.7%
evolves via intake + LBP
Conviction
5/5
Signal quality
C
Resolution
pending
Window
2026-01-01 – 2026-11-30
Edges in / out
8 / 5
Tickers exposed
33

Prediction text

AI computer-use benchmarks (OSWorld, Tbench) have broken through human level | you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.

Verbatim quote

From episode "Meta Buys Moltbook, GPT 5.4, and Fruitfly Brain Upload | Moonshots Live at The Abundance Summit 238"
you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.

Predictor: Emad Mostaque

κ + Brier as of 2026-05-22
κ (discount)
0.722
Brier
0.0073
excellent
Hits / Misses
3 / 0
of 4 resolved
Hit rate
75.0%
Calibration plot (stated vs observed)

Evidence about this node from Emad Mostaque is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).

Reference class

Not linked

This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.

Probability over time

4 prob_history rows
0%25%50%75%100%prior 55%2026-04-302026-05-022026-05-03
intake v2milestone miss sweeplbp propagationreference class assignedlegacy v1prior_prob (analyst seed)current = 44.7%

Milestone chain

Pre-event signals (upstream prereqs + window checkpoints) → resolution event → downstream cascades. Status/dates update from linked nodes; re-derive nightly via scripts/ops/derive_milestones.py.
Leading chain: 7 fired ✓ · 1 overdue ⏱
  1. 2026-02-28hitClaude Opus 4.6 narrowly exceeds OSWorld human baseline at 72.7%
    How: Claude Opus 4.6 scores 72.7% on OSWorld vs 72.4% human-expert baseline, validating thesis crossover
    Source: https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — Claude Opus 4.6 72.7%conf 95%
  2. 2026-03-05hitGPT-5.4 hits 75.0% on OSWorld — first clear superhuman margin
    How: GPT-5.4 scores >=75% on OSWorld, exceeding human baseline by clear margin (>=2pp)
    Source: https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — GPT-5.4 75.0%conf 95%
  3. 2026-04-15overdueStanford AI Index 2026 documents OSWorld 12% to 66.3% accuracy jump
    How: Stanford AI Index 2026 published showing OSWorld accuracy gain from ~12% to >=66% in 12 months
    Source: https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance — Stanford AI Indexconf 92%
  4. 2026-04-29hitHolo3-35B-A3B leads OSWorld-Verified at 82.6%
    How: Holo3-35B-A3B scores 82.6% on OSWorld-Verified leaderboard — >10pp above human baseline
    Source: https://benchlm.ai/benchmarks/osWorldVerified — Holo3 leaderboardconf 95%
  5. 2026-06-01 → 2026-12-31pendingComputer-use agent on OSWorld reaches 90% human-task efficiency
    How: Frontier computer-use agent achieves <=1.1x human-step count (currently 1.4x) at >=80% OSWorld accuracy
    Source: https://arxiv.org/abs/2506.16042 — OSWorld-Human efficiency benchmarkconf 55%
  6. 2026-08-01 → 2027-06-30pendingCascade: Major enterprise SaaS deploys OSWorld-grade computer-use agent in production
    How: At least 3 Fortune-500 enterprises announce production-grade computer-use agent automating >=10% of knowledge worker tasks
    Source: Cascade from SOTA OSWorld scores driving enterprise rolloutconf 55%

What if this resolves?

Clamp this prediction TRUE or FALSE and run a counterfactual Gibbs sample. Surfaces the predictions whose marginals shift most under that assumption.
(live posterior: 45%)

Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"

Evidence chain

Every probability update with full Bayesian provenance — chronological, latest first
LBP2026-05-03T02:00:01Z44.7%+1.9pp
Network propagation: 42.8% → 44.7%
6-iter LBP, residual 0.00677 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 1a683ac9
metadata_milestone_miss_sweep2026-05-02T22:07:21Z42.8%-6.7pp
metadata_milestone_miss_sweep bayesian_v2 n=1 inside=0.428 blend=0.428 LLR=-0.269 κ=0.72 no_blend
Raw metadata
{
  "trf": 0.6338685427014515,
  "kappa": 0.7222,
  "base_rate": null,
  "predictor": "Emad Mostaque",
  "total_llr": -0.4054651081081644,
  "grace_days": 7,
  "bayesian_v2": true,
  "prior_logit": -0.0201207213393815,
  "bayes_factor": "1.3:1 against",
  "blend_reason": "no reference_class linked",
  "inside_prior": 0.4949699893612384,
  "kappa_source": "predictor_table",
  "n_milestones": 1,
  "blend_applied": false,
  "contributions": [
    {
      "llr": -0.4054651081081644,
      "kind": "llm_pre_event",
      "kappa": 0.664424,
      "label": "Stanford AI Index 2026 documents OSWorld 12% to 66.3% accuracy jump",
      "weight": 0.4,
      "strength": "weak",
      "confidence": 0.92,
      "source_url": "https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance",
      "adjusted_llr": -0.26940074898965904,
      "expected_date": "2026-04-15",
      "measurement_criterion": "Stanford AI Index 2026 published showing OSWorld accuracy gain from ~12% to >=66% in 12 months"
    }
  ],
  "evidence_kind": "metadata_milestone_miss_sweep",
  "inside_source": "history_v2",
  "inside_weight": 0.5562920201089838,
  "outside_weight": 0.4437079798910162,
  "posterior_prob": 0.4281210230889299,
  "posterior_logit": -0.2895214703290405,
  "predictor_brier": 0.0073,
  "inside_posterior": 0.4281210230889299,
  "blended_posterior": 0.4281210230889299,
  "reference_class_id": null,
  "total_adjusted_llr": -0.26940074898965904,
  "predictor_n_resolved": 4
}
LBP2026-04-30T16:39:51Z49.5%-2.1pp
Network propagation: 51.6% → 49.5%
5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v2 · run 0c8a4ea3
LBP2026-04-30T02:18:57Z51.6%-3.4pp
Network propagation: 55.0% → 51.6%
5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v1 · run 592311ef

Network propagation neighbors

Top edges sorted by latest LBP cross-impact
All propagation →

Top incoming (parents)

Edges that influence THIS node's belief

KindNodeTheir probP(c|s=T)P(c|s=F)Δ implied
killerTK03
AI Regulatory Moratorium (EU/US Capability Freeze)
10.0%0.0500.550+0.053
killerTK02
AI Compute Supply Shock (TSMC/Taiwan Disruption)
12.0%0.0500.550+0.043
prereqSEM_042
2025 will be the definitive year that agentic systems finallKevin Weil
73.8%0.5500.050-0.033
killerTK01
AGI Capability Plateau (2026-27 Training Stall)
15.0%0.0500.550+0.028
prereqSEM_012
Nvidia quadrupled chip production output while only doublingJensen Huang
75.0%0.5500.050-0.026

Top outgoing (children)

Predictions THIS node influences

KindNodeTheir probP(c|s=T)P(c|s=F)Δ implied
prereq231_013
Math is cooked (will be solved), physics cooked, biology chaAlex Wissner-Gross
35.4%0.6200.050-0.045
prereqCMQ_002
By 2028, AI systems will reach 'independent researcher' leveSam Altman
31.4%0.5500.050-0.037
prereq241_043
ASI will arrive within 2 years to 5 years to this next decadPeter Diamandis
35.9%0.6500.050-0.036
prereq235_030
Ray Kurzweil predicts Longevity Escape Velocity (LEV) by 203Ray Kurzweil
39.2%0.7500.050-0.025
prereq232_055
We're exiting the industrial age permanently as recursive sePeter Diamandis
35.5%0.7000.050-0.010

Ticker exposure

33 ticker(s) linked

Beneficiaries (23)

SOUNCRWVSITMNVDAARMGTLBBBAITSMAPLDCEVAAIMSFTMRVLSFTBYORCLQCOMAVGOBABAAMDGOOGLIBMAMZNMETA

Adverse (6)

WNSCHGGCTSHIBMINFYACN

Prerequisites (8)

Predictions that must hit first
TypePredTitleDomainLag
prereqSEM_008Training runs costing $10 billion for a single model will commence sometime in 2025.AI
prereq238_009Recursive self-improvement is already happening now (no longer three years out)AI
prereqSEM_012Nvidia quadrupled chip production output while only doubling human headcount — achieved by deploying AI coding tools (Cursor, Claude Code) across engineering.AI/Manufacturing
prereqSEM_0422025 will be the definitive year that agentic systems finally hit the mainstream.AI/Agents
killerTK14Superbubble Pop (S&P 500 -40%, Moonshot Capital Evaporates)
killerTK01AGI Capability Plateau (2026-27 Training Stall)
killerTK02AI Compute Supply Shock (TSMC/Taiwan Disruption)
killerTK03AI Regulatory Moratorium (EU/US Capability Freeze)

Dependents (5)

Predictions enabled by this
TypePredTitleDomainLag
prereq235_030Ray Kurzweil predicts Longevity Escape Velocity (LEV) by 2033.Biotech/Longevity
prereq232_055We're exiting the industrial age permanently as recursive self-improvement unfolds.AI
prereq241_043ASI will arrive within 2 years to 5 years to this next decadeAI
prereq231_013Math is cooked (will be solved), physics cooked, biology char broiled.AI
prereqCMQ_002By 2028, AI systems will reach 'independent researcher' level — driving autonomous scientific discoveries without human intervention.AI

Linked documents (10)

Auto-generated by cosine similarity from Polymarket / Manifold / EDGAR / GDELT
SimSourceTitleMarket probPolarityReviewedPublished
0.686arxivAcademiClaw: When Students Set Challenges for AI Agentsmentionspending2026-05-04
0.677arxivWorkspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependenciesmentionspending2026-05-05
0.666arxivOffloading Score: Measuring AI Reliance Through Counterfactual Workflowsmentionspending2026-05-28
0.645github_releasefacebookresearch/neuroai v0.2.2mentionspending2026-05-26
0.644arxivOperation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detectionmentionspending2026-06-04
0.635github_releaseopenai/openai-python v2.26.0mentionspending2026-03-05
0.633github_releasetensorflow/tensorflow v2.16.1mentionspending2024-03-07
0.627arxivPRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewingmentionspending2026-05-28
0.625github_releasetensorflow/tensorflow v2.16.0-rc0mentionspending2024-02-26
0.624arxivA Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptationmentionspending2026-06-01

Raw metadata

From Thesis_Timeline_v1.0_FINAL workbook
{
  "nia": false,
  "url": "https://www.youtube.com/watch?v=d__HRChE2ZE",
  "mode": "THESIS",
  "role": "Host",
  "context": "you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.",
  "to_year": 2026,
  "verbatim": "you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.",
  "conv_cues": "broken through; better than humans",
  "direction": "HAPPEN",
  "from_year": 2026,
  "timeframe": "Now",
  "conv_level": "HIGH",
  "milestones": [
    {
      "kind": "llm_pre_event",
      "label": "Claude Opus 4.6 narrowly exceeds OSWorld human baseline at 72.7%",
      "source": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — Claude Opus 4.6 72.7%",
      "status": "hit",
      "weight": 0.4,
      "ordinal": -8,
      "source_id": null,
      "confidence": 0.95,
      "source_url": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents",
      "expected_date": "2026-02-28",
      "observed_date": "2026-02-28",
      "research_origin": "deep_research",
      "measurement_criterion": "Claude Opus 4.6 scores 72.7% on OSWorld vs 72.4% human-expert baseline, validating thesis crossover"
    },
    {
      "kind": "llm_pre_event",
      "label": "GPT-5.4 hits 75.0% on OSWorld — first clear superhuman margin",
      "source": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — GPT-5.4 75.0%",
      "status": "hit",
      "weight": 0.4,
      "ordinal": -7,
      "source_id": null,
      "confidence": 0.95,
      "source_url": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents",
      "expected_date": "2026-03-05",
      "observed_date": "2026-03-05",
      "research_origin": "deep_research",
      "measurement_criterion": "GPT-5.4 scores >=75% on OSWorld, exceeding human baseline by clear margin (>=2pp)"
    },
    {
      "kind": "llm_pre_event",
      "label": "Stanford AI Index 2026 documents OSWorld 12% to 66.3% accuracy jump",
      "source": "https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance — Stanford AI Index",
      "status": "overdue",
      "weight": 0.4,
      "ordinal": -6,
      "source_id": null,
      "confidence": 0.92,
      "source_url": "https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance",
      "expected_date": "2026-04-15",
      "miss_emitted_at": "2026-05-02T22:07:21.384228+00:00",
      "miss_emitted_by": "metadata_milestone_sweep",
      "research_origin": "deep_research",
      "measurement_criterion": "Stanford AI Index 2026 published showing OSWorld accuracy gain from ~12% to >=66% in 12 months"
    },
    {
      "kind": "prereq",
      "label": "Nvidia quadrupled chip production output while only doubling human headcount — achieved by deploying AI coding tools (Cursor, Claude Code) a",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -5,
      "source_id": "SEM_012",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "prereq",
      "label": "Training runs costing $10 billion for a single model will commence sometime in 2025.",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -4,
      "source_id": "SEM_008",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "prereq",
      "label": "2025 will be the definitive year that agentic systems finally hit the mainstream.",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -3,
      "source_id": "SEM_042",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "prereq",
      "label": "Recursive self-improvement is already happening now (no longer t
... (truncated)