AI computer-use benchmarks (OSWorld, Tbench) have broken through human level
Predictor: Emad Mostaque · ep#238 "Meta Buys Moltbook, GPT 5.4, and Fruitfly Brain Upload | Moonshots Live at The Abundance Summit 238" · source
Prediction text
AI computer-use benchmarks (OSWorld, Tbench) have broken through human level | you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.
Verbatim quote
you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.
Predictor: Emad Mostaque
Calibration plot (stated vs observed)
Evidence about this node from Emad Mostaque is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).
Reference class
This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.
Probability over time
Milestone chain
- 2026-02-28hitClaude Opus 4.6 narrowly exceeds OSWorld human baseline at 72.7%How: Claude Opus 4.6 scores 72.7% on OSWorld vs 72.4% human-expert baseline, validating thesis crossoverSource: https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — Claude Opus 4.6 72.7%conf 95%
- 2026-03-05hitGPT-5.4 hits 75.0% on OSWorld — first clear superhuman marginHow: GPT-5.4 scores >=75% on OSWorld, exceeding human baseline by clear margin (>=2pp)Source: https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — GPT-5.4 75.0%conf 95%
- 2026-04-15overdueStanford AI Index 2026 documents OSWorld 12% to 66.3% accuracy jumpHow: Stanford AI Index 2026 published showing OSWorld accuracy gain from ~12% to >=66% in 12 monthsSource: https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance — Stanford AI Indexconf 92%
- 2026-04-29hitHolo3-35B-A3B leads OSWorld-Verified at 82.6%How: Holo3-35B-A3B scores 82.6% on OSWorld-Verified leaderboard — >10pp above human baselineSource: https://benchlm.ai/benchmarks/osWorldVerified — Holo3 leaderboardconf 95%
- 2026-06-01 → 2026-12-31pendingComputer-use agent on OSWorld reaches 90% human-task efficiencyHow: Frontier computer-use agent achieves <=1.1x human-step count (currently 1.4x) at >=80% OSWorld accuracySource: https://arxiv.org/abs/2506.16042 — OSWorld-Human efficiency benchmarkconf 55%
- 2026-08-01 → 2027-06-30pendingCascade: Major enterprise SaaS deploys OSWorld-grade computer-use agent in productionHow: At least 3 Fortune-500 enterprises announce production-grade computer-use agent automating >=10% of knowledge worker tasksSource: Cascade from SOTA OSWorld scores driving enterprise rolloutconf 55%
What if this resolves?
Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"
Evidence chain
Raw metadata
{
"trf": 0.6338685427014515,
"kappa": 0.7222,
"base_rate": null,
"predictor": "Emad Mostaque",
"total_llr": -0.4054651081081644,
"grace_days": 7,
"bayesian_v2": true,
"prior_logit": -0.0201207213393815,
"bayes_factor": "1.3:1 against",
"blend_reason": "no reference_class linked",
"inside_prior": 0.4949699893612384,
"kappa_source": "predictor_table",
"n_milestones": 1,
"blend_applied": false,
"contributions": [
{
"llr": -0.4054651081081644,
"kind": "llm_pre_event",
"kappa": 0.664424,
"label": "Stanford AI Index 2026 documents OSWorld 12% to 66.3% accuracy jump",
"weight": 0.4,
"strength": "weak",
"confidence": 0.92,
"source_url": "https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance",
"adjusted_llr": -0.26940074898965904,
"expected_date": "2026-04-15",
"measurement_criterion": "Stanford AI Index 2026 published showing OSWorld accuracy gain from ~12% to >=66% in 12 months"
}
],
"evidence_kind": "metadata_milestone_miss_sweep",
"inside_source": "history_v2",
"inside_weight": 0.5562920201089838,
"outside_weight": 0.4437079798910162,
"posterior_prob": 0.4281210230889299,
"posterior_logit": -0.2895214703290405,
"predictor_brier": 0.0073,
"inside_posterior": 0.4281210230889299,
"blended_posterior": 0.4281210230889299,
"reference_class_id": null,
"total_adjusted_llr": -0.26940074898965904,
"predictor_n_resolved": 4
}Network propagation neighbors
Top incoming (parents)
Edges that influence THIS node's belief
| Kind | Node | Their prob | P(c|s=T) | P(c|s=F) | Δ implied |
|---|---|---|---|---|---|
| killer | TK03 AI Regulatory Moratorium (EU/US Capability Freeze) | 10.0% | 0.050 | 0.550 | +0.053 |
| killer | TK02 AI Compute Supply Shock (TSMC/Taiwan Disruption) | 12.0% | 0.050 | 0.550 | +0.043 |
| prereq | SEM_042 2025 will be the definitive year that agentic systems finall — Kevin Weil | 73.8% | 0.550 | 0.050 | -0.033 |
| killer | TK01 AGI Capability Plateau (2026-27 Training Stall) | 15.0% | 0.050 | 0.550 | +0.028 |
| prereq | SEM_012 Nvidia quadrupled chip production output while only doubling — Jensen Huang | 75.0% | 0.550 | 0.050 | -0.026 |
Top outgoing (children)
Predictions THIS node influences
| Kind | Node | Their prob | P(c|s=T) | P(c|s=F) | Δ implied |
|---|---|---|---|---|---|
| prereq | 231_013 Math is cooked (will be solved), physics cooked, biology cha — Alex Wissner-Gross | 35.4% | 0.620 | 0.050 | -0.045 |
| prereq | CMQ_002 By 2028, AI systems will reach 'independent researcher' leve — Sam Altman | 31.4% | 0.550 | 0.050 | -0.037 |
| prereq | 241_043 ASI will arrive within 2 years to 5 years to this next decad — Peter Diamandis | 35.9% | 0.650 | 0.050 | -0.036 |
| prereq | 235_030 Ray Kurzweil predicts Longevity Escape Velocity (LEV) by 203 — Ray Kurzweil | 39.2% | 0.750 | 0.050 | -0.025 |
| prereq | 232_055 We're exiting the industrial age permanently as recursive se — Peter Diamandis | 35.5% | 0.700 | 0.050 | -0.010 |
Ticker exposure
Beneficiaries (23)
Adverse (6)
Prerequisites (8)
| Type | Pred | Title | Domain | Lag |
|---|---|---|---|---|
| prereq | SEM_008 | Training runs costing $10 billion for a single model will commence sometime in 2025. | AI | — |
| prereq | 238_009 | Recursive self-improvement is already happening now (no longer three years out) | AI | — |
| prereq | SEM_012 | Nvidia quadrupled chip production output while only doubling human headcount — achieved by deploying AI coding tools (Cursor, Claude Code) across engineering. | AI/Manufacturing | — |
| prereq | SEM_042 | 2025 will be the definitive year that agentic systems finally hit the mainstream. | AI/Agents | — |
| killer | TK14 | Superbubble Pop (S&P 500 -40%, Moonshot Capital Evaporates) | — | — |
| killer | TK01 | AGI Capability Plateau (2026-27 Training Stall) | — | — |
| killer | TK02 | AI Compute Supply Shock (TSMC/Taiwan Disruption) | — | — |
| killer | TK03 | AI Regulatory Moratorium (EU/US Capability Freeze) | — | — |
Dependents (5)
| Type | Pred | Title | Domain | Lag |
|---|---|---|---|---|
| prereq | 235_030 | Ray Kurzweil predicts Longevity Escape Velocity (LEV) by 2033. | Biotech/Longevity | — |
| prereq | 232_055 | We're exiting the industrial age permanently as recursive self-improvement unfolds. | AI | — |
| prereq | 241_043 | ASI will arrive within 2 years to 5 years to this next decade | AI | — |
| prereq | 231_013 | Math is cooked (will be solved), physics cooked, biology char broiled. | AI | — |
| prereq | CMQ_002 | By 2028, AI systems will reach 'independent researcher' level — driving autonomous scientific discoveries without human intervention. | AI | — |
Linked documents (10)
Raw metadata
{
"nia": false,
"url": "https://www.youtube.com/watch?v=d__HRChE2ZE",
"mode": "THESIS",
"role": "Host",
"context": "you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.",
"to_year": 2026,
"verbatim": "you've got the OS world verified and the Tathon benchmarks because OpenAI just bought OpenClaw. And now those benchmarks are actually just broken through human level. So AIs can use the computers better than humans.",
"conv_cues": "broken through; better than humans",
"direction": "HAPPEN",
"from_year": 2026,
"timeframe": "Now",
"conv_level": "HIGH",
"milestones": [
{
"kind": "llm_pre_event",
"label": "Claude Opus 4.6 narrowly exceeds OSWorld human baseline at 72.7%",
"source": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — Claude Opus 4.6 72.7%",
"status": "hit",
"weight": 0.4,
"ordinal": -8,
"source_id": null,
"confidence": 0.95,
"source_url": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents",
"expected_date": "2026-02-28",
"observed_date": "2026-02-28",
"research_origin": "deep_research",
"measurement_criterion": "Claude Opus 4.6 scores 72.7% on OSWorld vs 72.4% human-expert baseline, validating thesis crossover"
},
{
"kind": "llm_pre_event",
"label": "GPT-5.4 hits 75.0% on OSWorld — first clear superhuman margin",
"source": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents — GPT-5.4 75.0%",
"status": "hit",
"weight": 0.4,
"ordinal": -7,
"source_id": null,
"confidence": 0.95,
"source_url": "https://nerdleveltech.com/gpt-5-4-beats-humans-computer-use-ai-agents",
"expected_date": "2026-03-05",
"observed_date": "2026-03-05",
"research_origin": "deep_research",
"measurement_criterion": "GPT-5.4 scores >=75% on OSWorld, exceeding human baseline by clear margin (>=2pp)"
},
{
"kind": "llm_pre_event",
"label": "Stanford AI Index 2026 documents OSWorld 12% to 66.3% accuracy jump",
"source": "https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance — Stanford AI Index",
"status": "overdue",
"weight": 0.4,
"ordinal": -6,
"source_id": null,
"confidence": 0.92,
"source_url": "https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance",
"expected_date": "2026-04-15",
"miss_emitted_at": "2026-05-02T22:07:21.384228+00:00",
"miss_emitted_by": "metadata_milestone_sweep",
"research_origin": "deep_research",
"measurement_criterion": "Stanford AI Index 2026 published showing OSWorld accuracy gain from ~12% to >=66% in 12 months"
},
{
"kind": "prereq",
"label": "Nvidia quadrupled chip production output while only doubling human headcount — achieved by deploying AI coding tools (Cursor, Claude Code) a",
"status": "hit",
"weight": 0.5,
"ordinal": -5,
"source_id": "SEM_012",
"expected_date": "2026-04-29",
"observed_date": "2026-04-29"
},
{
"kind": "prereq",
"label": "Training runs costing $10 billion for a single model will commence sometime in 2025.",
"status": "hit",
"weight": 0.5,
"ordinal": -4,
"source_id": "SEM_008",
"expected_date": "2026-04-29",
"observed_date": "2026-04-29"
},
{
"kind": "prereq",
"label": "2025 will be the definitive year that agentic systems finally hit the mainstream.",
"status": "hit",
"weight": 0.5,
"ordinal": -3,
"source_id": "SEM_042",
"expected_date": "2026-04-29",
"observed_date": "2026-04-29"
},
{
"kind": "prereq",
"label": "Recursive self-improvement is already happening now (no longer t
... (truncated)