← Cockpit
COD_AI_004predictionAIautonomous-agents

Frontier agents reach one-workday autonomous task horizon by end 2027

Predictor: Codex Research Pack

Prior probability
55.0%
Current probability
36.0%
evolves via intake + LBP
Conviction
4/5
Signal quality
Resolution
pending
Window
2026-12-01 – 2027-12-31
Edges in / out
4 / 0
Tickers exposed
10

Prediction text

Frontier agents reach one-workday autonomous task horizon by end 2027

Predictor: Codex Research Pack

κ + Brier as of 2026-05-22
κ (discount)
0.850
Brier
Hits / Misses
0 / 0
Hit rate

Evidence about this node from Codex Research Pack is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).

Reference class

Not linked

This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.

Probability over time

4 prob_history rows
0%25%50%75%100%prior 55%2026-05-032026-05-172026-05-24
intake v2milestone miss sweeplbp propagationreference class assignedlegacy v1prior_prob (analyst seed)current = 36.0%

Milestone chain

Pre-event signals (upstream prereqs + window checkpoints) → resolution event → downstream cascades. Status/dates update from linked nodes; re-derive nightly via scripts/ops/derive_milestones.py.
Leading chain: 1 fired ✓ · 6 pending
  1. 2026-01-29hitMETR Time Horizon 1.1 framework released January 2026
    How: METR publishes Time Horizon 1.1 evaluation expanding suite by 34% (228/170 tasks) and doubling 8+ hour tasks (31/14)
    Source: https://metr.org/blog/2026-1-29-time-horizon-1-1/ — METR Time Horizon 1.1 releaseconf 99%
    Notes: HIT — METR upgraded benchmark suite to handle longer-horizon evals before this prediction's window opens.
  2. 2026-04-01 → 2026-10-31pendingGPT-5.2 or successor leads METR Time Horizon test
    How: OpenAI GPT-5.2/6 or equivalent (Claude/Gemini) sets new SOTA on METR Time Horizon benchmark with ≥6h 50% horizon
    Source: https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10 — GPT-5.2 leading METR testconf 70%
  3. 2026-04-01 → 2026-12-31pendingFrontier model achieves ≥4-hour 50% time horizon
    How: METR-published 50% time horizon for any frontier generalist agent reaches ≥4 hours (current doubles every 7 months from ~14h on coding subset)
    Source: https://metr.org/time-horizons/ — METR exponential doubling patternconf 75%
  4. 2026-06-01 → 2027-06-30pendingMETR researchers simulate 200-hour time horizon AIs
    How: METR publishes tabletop exercise or simulation report for 200-hour-horizon agents (precondition for 8-hour reliable horizon)
    Source: https://metr.org/time-horizons/ — Thomas Kwa describes 200h tabletopconf 55%
  5. 2027-01-27pendingQ1 window check-in (25%)
  6. 2027-03-26pendingQ2 window check-in (50%)
  7. 2027-05-23pendingQ3 window check-in (75%)
  8. 2027-06-01 → 2027-12-31pendingFrontier agent demonstrates ≥8h task with ≥50% reliability (resolution)
    How: METR-style public eval shows generalist frontier agent completes 8+ hour expert tasks at ≥50% success
    Source: https://metr.org/time-horizons/ — 7-month doubling implies 8h horizon in late 2027conf 55%
    Notes: Direct resolution criterion per Codex pack. Doubling cadence supports late-2027 plausibility.
  9. 2027-06-01 → 2028-06-30pendingReal-world agent deployment (one-workday autonomous loops)
    How: ≥1 enterprise (Anthropic, OpenAI, Cognition Devin, etc.) discloses production agent running ≥8h continuous tasks with measurable reliability
    Source: Anthropic/OpenAI product blogs, Cognition releasesconf 45%

What if this resolves?

Clamp this prediction TRUE or FALSE and run a counterfactual Gibbs sample. Surfaces the predictions whose marginals shift most under that assumption.
(live posterior: 36%)

Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"

Evidence chain

Every probability update with full Bayesian provenance — chronological, latest first
LBP2026-05-24T02:00:02Z36.0%-1.2pp
Network propagation: 37.2% → 36.0%
4-iter LBP, residual 0.01000 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 806b02f8
LBP2026-05-17T02:00:01Z37.2%-2.5pp
Network propagation: 39.7% → 37.2%
5-iter LBP, residual 0.00689 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run e607fa96
LBP2026-05-10T02:00:02Z39.7%-5.1pp
Network propagation: 44.7% → 39.7%
6-iter LBP, residual 0.00584 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run e5c18d29
LBP2026-05-03T02:00:01Z44.7%-10.3pp
Network propagation: 55.0% → 44.7%
6-iter LBP, residual 0.00677 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 1a683ac9

Network propagation neighbors

Top edges sorted by latest LBP cross-impact
All propagation →

Top incoming (parents)

Edges that influence THIS node's belief

KindNodeTheir probP(c|s=T)P(c|s=F)Δ implied
prereqS_AGI_MID_2029
AGI mid: Kurzweil 2029 path
35.0%0.5500.050-0.135

Top outgoing (children)

Predictions THIS node influences

No outgoing edges.

Ticker exposure

10 ticker(s) linked

Beneficiaries (5)

AMZNMETAMSFTNVDAGOOGL

Adverse (5)

WNSCTSHEPAMINFYACN

Prerequisites (4)

Predictions that must hit first
TypePredTitleDomainLag
prereqS_AGI_MID_2029AGI mid: Kurzweil 2029 pathagi_general_capability
correlateS_AGI_FAST_2027AGI fast: drop-in remote worker by 2027-09agi_general_capability
correlateS_ROBOTAXI_MASS_2030Robotaxi >10% urban miles by Nov 2030robotaxi_deployment
correlateS_AI_PAUSE_2026Major-country AI pause beginning 2026ai_regulatory_pause

Dependents (0)

Predictions enabled by this
TypePredTitleDomainLag
No dependents

Linked documents (10)

Auto-generated by cosine similarity from Polymarket / Manifold / EDGAR / GDELT
SimSourceTitleMarket probPolarityReviewedPublished
0.708manifoldWill any frontier lab be near-fully automated before 2029?30%mentionspending2026-05-10
0.700codex_research_packMETR - Measuring AI Ability to Complete Long Tasksmentionspending2025-03-19
0.700codex_research_packOECD - Exploring Possible AI Trajectories Through 2030mentionspending2026-04-26
0.681arxivTerminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?mentionspending2026-05-04
0.656arxivOpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectoriesmentionspending2026-05-05
0.618arxivSAGA: A Robust Self-Attention and Goal-Aware Anchor-based Planner for Safe UAV Autonomous Navigationmentionspending2026-05-04
0.603arxivPlug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigationmentionspending2026-05-07
0.599manifoldBest METR 80% Time Horizon before October 2026mentionspending2026-06-04
0.579manifoldClaude Opus 4.7 METR 50% time horizonmentionspending2026-05-05
0.568manifoldWhat goals will I achieve this week?mentionspending2026-05-10

Raw metadata

From Thesis_Timeline_v1.0_FINAL workbook
{
  "pack_id": "codex_research_event_pack_2026_04_30",
  "milestones": [
    {
      "kind": "llm_pre_event",
      "label": "METR Time Horizon 1.1 framework released January 2026",
      "notes": "HIT — METR upgraded benchmark suite to handle longer-horizon evals before this prediction's window opens.",
      "source": "https://metr.org/blog/2026-1-29-time-horizon-1-1/ — METR Time Horizon 1.1 release",
      "status": "hit",
      "weight": 0.4,
      "ordinal": -7,
      "source_id": null,
      "confidence": 0.99,
      "source_url": "https://metr.org/blog/2026-1-29-time-horizon-1-1/",
      "expected_date": "2026-01-29",
      "observed_date": "2026-01-29",
      "research_origin": "deep_research",
      "measurement_criterion": "METR publishes Time Horizon 1.1 evaluation expanding suite by 34% (228/170 tasks) and doubling 8+ hour tasks (31/14)"
    },
    {
      "kind": "llm_pre_event",
      "label": "GPT-5.2 or successor leads METR Time Horizon test",
      "source": "https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10 — GPT-5.2 leading METR test",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -6,
      "source_id": null,
      "confidence": 0.7,
      "source_url": "https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10",
      "expected_date": "2026-07-16",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2026-10-31",
        "from": "2026-04-01"
      },
      "measurement_criterion": "OpenAI GPT-5.2/6 or equivalent (Claude/Gemini) sets new SOTA on METR Time Horizon benchmark with ≥6h 50% horizon"
    },
    {
      "kind": "llm_pre_event",
      "label": "Frontier model achieves ≥4-hour 50% time horizon",
      "source": "https://metr.org/time-horizons/ — METR exponential doubling pattern",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -5,
      "source_id": null,
      "confidence": 0.75,
      "source_url": "https://metr.org/time-horizons/",
      "expected_date": "2026-08-16",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2026-12-31",
        "from": "2026-04-01"
      },
      "measurement_criterion": "METR-published 50% time horizon for any frontier generalist agent reaches ≥4 hours (current doubles every 7 months from ~14h on coding subset)"
    },
    {
      "kind": "llm_pre_event",
      "label": "METR researchers simulate 200-hour time horizon AIs",
      "source": "https://metr.org/time-horizons/ — Thomas Kwa describes 200h tabletop",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -4,
      "source_id": null,
      "confidence": 0.55,
      "source_url": "https://metr.org/time-horizons/",
      "expected_date": "2026-12-15",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2027-06-30",
        "from": "2026-06-01"
      },
      "measurement_criterion": "METR publishes tabletop exercise or simulation report for 200-hour-horizon agents (precondition for 8-hour reliable horizon)"
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q1 window check-in (25%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -3,
      "source_id": null,
      "expected_date": "2027-01-27",
      "observed_date": null
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q2 window check-in (50%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -2,
      "source_id": null,
      "expected_date": "2027-03-26",
      "observed_date": null
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q3 window check-in (75%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -1,
      "source_id": null,
      "expected_date": "2027-05-23",
      "observed_date": null
    },
    {
      "kind": "event",
      "label": "Frontier agents reach one-workday autonomous task horizon by end 2027",

... (truncated)