COD_AI_004predictionAIautonomous-agents

Frontier agents reach one-workday autonomous task horizon by end 2027

Predictor: Codex Research Pack

Prior probability

55.0%

Current probability

36.0%

evolves via intake + LBP

Conviction

4/5

Signal quality

—

Resolution

pending

Window

2026-12-01 – 2027-12-31

Edges in / out

4 / 0

Tickers exposed

Prediction text

Frontier agents reach one-workday autonomous task horizon by end 2027

Predictor: Codex Research Pack

κ + Brier as of 2026-05-22

Full calibration →

κ (discount)

0.850

Brier

—

Hits / Misses

0 / 0

Hit rate

—

Evidence about this node from Codex Research Pack is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).

Reference class

Not linked

This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.

Probability over time

4 prob_history rows

intake v2milestone miss sweeplbp propagationreference class assignedlegacy v1prior_prob (analyst seed)current = 36.0%

Milestone chain

Pre-event signals (upstream prereqs + window checkpoints) → resolution event → downstream cascades. Status/dates update from linked nodes; re-derive nightly via scripts/ops/derive_milestones.py.

Leading chain: 1 fired ✓ · 6 pending

2026-01-29hitMETR Time Horizon 1.1 framework released January 2026
How: METR publishes Time Horizon 1.1 evaluation expanding suite by 34% (228/170 tasks) and doubling 8+ hour tasks (31/14)
Source: https://metr.org/blog/2026-1-29-time-horizon-1-1/ — METR Time Horizon 1.1 releaseconf 99%
Notes: HIT — METR upgraded benchmark suite to handle longer-horizon evals before this prediction's window opens.
2026-04-01 → 2026-10-31pendingGPT-5.2 or successor leads METR Time Horizon test
How: OpenAI GPT-5.2/6 or equivalent (Claude/Gemini) sets new SOTA on METR Time Horizon benchmark with ≥6h 50% horizon
Source: https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10 — GPT-5.2 leading METR testconf 70%
2026-04-01 → 2026-12-31pendingFrontier model achieves ≥4-hour 50% time horizon
How: METR-published 50% time horizon for any frontier generalist agent reaches ≥4 hours (current doubles every 7 months from ~14h on coding subset)
Source: https://metr.org/time-horizons/ — METR exponential doubling patternconf 75%
2026-06-01 → 2027-06-30pendingMETR researchers simulate 200-hour time horizon AIs
How: METR publishes tabletop exercise or simulation report for 200-hour-horizon agents (precondition for 8-hour reliable horizon)
Source: https://metr.org/time-horizons/ — Thomas Kwa describes 200h tabletopconf 55%
2027-01-27pendingQ1 window check-in (25%)
2027-03-26pendingQ2 window check-in (50%)
2027-05-23pendingQ3 window check-in (75%)
2027-07-20pendingFrontier agents reach one-workday autonomous task horizon by end 2027
2027-06-01 → 2027-12-31pendingFrontier agent demonstrates ≥8h task with ≥50% reliability (resolution)
How: METR-style public eval shows generalist frontier agent completes 8+ hour expert tasks at ≥50% success
Source: https://metr.org/time-horizons/ — 7-month doubling implies 8h horizon in late 2027conf 55%
Notes: Direct resolution criterion per Codex pack. Doubling cadence supports late-2027 plausibility.
2027-06-01 → 2028-06-30pendingReal-world agent deployment (one-workday autonomous loops)
How: ≥1 enterprise (Anthropic, OpenAI, Cognition Devin, etc.) discloses production agent running ≥8h continuous tasks with measurable reliability
Source: Anthropic/OpenAI product blogs, Cognition releasesconf 45%

What if this resolves?

Clamp this prediction TRUE or FALSE and run a counterfactual Gibbs sample. Surfaces the predictions whose marginals shift most under that assumption.

(live posterior: 36%)

Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"

Evidence chain

Every probability update with full Bayesian provenance — chronological, latest first

LBP2026-05-24T02:00:02Z36.0%-1.2pp

Network propagation: 37.2% → 36.0%

4-iter LBP, residual 0.01000 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 806b02f8

LBP2026-05-17T02:00:01Z37.2%-2.5pp

Network propagation: 39.7% → 37.2%

5-iter LBP, residual 0.00689 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run e607fa96

LBP2026-05-10T02:00:02Z39.7%-5.1pp

Network propagation: 44.7% → 39.7%

6-iter LBP, residual 0.00584 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run e5c18d29

LBP2026-05-03T02:00:01Z44.7%-10.3pp

Network propagation: 55.0% → 44.7%

6-iter LBP, residual 0.00677 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 1a683ac9

Network propagation neighbors

Top edges sorted by latest LBP cross-impact

All propagation →

Top incoming (parents)

Edges that influence THIS node's belief

Kind	Node	Their prob	P(c\|s=T)	P(c\|s=F)	Δ implied
prereq	S_AGI_MID_2029 AGI mid: Kurzweil 2029 path	35.0%	0.550	0.050	-0.135

Top outgoing (children)

Predictions THIS node influences

No outgoing edges.

Ticker exposure

10 ticker(s) linked

Beneficiaries (5)

AMZN META MSFT NVDA GOOGL

Adverse (5)

WNS CTSH EPAM INFY ACN

Prerequisites (4)

Predictions that must hit first

Type	Pred	Title	Domain	Lag
prereq	S_AGI_MID_2029	AGI mid: Kurzweil 2029 path	agi_general_capability	—
correlate	S_AGI_FAST_2027	AGI fast: drop-in remote worker by 2027-09	agi_general_capability	—
correlate	S_ROBOTAXI_MASS_2030	Robotaxi >10% urban miles by Nov 2030	robotaxi_deployment	—
correlate	S_AI_PAUSE_2026	Major-country AI pause beginning 2026	ai_regulatory_pause	—

Dependents (0)

Predictions enabled by this

Type	Pred	Title	Domain	Lag
No dependents

Linked documents (10)

Auto-generated by cosine similarity from Polymarket / Manifold / EDGAR / GDELT

Sim	Source	Title	Market prob	Polarity	Reviewed	Published
0.708	manifold	Will any frontier lab be near-fully automated before 2029?	30%	mentions	pending	2026-05-10
0.700	codex_research_pack	METR - Measuring AI Ability to Complete Long Tasks	—	mentions	pending	2025-03-19
0.700	codex_research_pack	OECD - Exploring Possible AI Trajectories Through 2030	—	mentions	pending	2026-04-26
0.681	arxiv	Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?	—	mentions	pending	2026-05-04
0.656	arxiv	OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories	—	mentions	pending	2026-05-05
0.618	arxiv	SAGA: A Robust Self-Attention and Goal-Aware Anchor-based Planner for Safe UAV Autonomous Navigation	—	mentions	pending	2026-05-04
0.603	arxiv	Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation	—	mentions	pending	2026-05-07
0.599	manifold	Best METR 80% Time Horizon before October 2026	—	mentions	pending	2026-06-04
0.579	manifold	Claude Opus 4.7 METR 50% time horizon	—	mentions	pending	2026-05-05
0.568	manifold	What goals will I achieve this week?	—	mentions	pending	2026-05-10

Raw metadata

From Thesis_Timeline_v1.0_FINAL workbook

{
  "pack_id": "codex_research_event_pack_2026_04_30",
  "milestones": [
    {
      "kind": "llm_pre_event",
      "label": "METR Time Horizon 1.1 framework released January 2026",
      "notes": "HIT — METR upgraded benchmark suite to handle longer-horizon evals before this prediction's window opens.",
      "source": "https://metr.org/blog/2026-1-29-time-horizon-1-1/ — METR Time Horizon 1.1 release",
      "status": "hit",
      "weight": 0.4,
      "ordinal": -7,
      "source_id": null,
      "confidence": 0.99,
      "source_url": "https://metr.org/blog/2026-1-29-time-horizon-1-1/",
      "expected_date": "2026-01-29",
      "observed_date": "2026-01-29",
      "research_origin": "deep_research",
      "measurement_criterion": "METR publishes Time Horizon 1.1 evaluation expanding suite by 34% (228/170 tasks) and doubling 8+ hour tasks (31/14)"
    },
    {
      "kind": "llm_pre_event",
      "label": "GPT-5.2 or successor leads METR Time Horizon test",
      "source": "https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10 — GPT-5.2 leading METR test",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -6,
      "source_id": null,
      "confidence": 0.7,
      "source_url": "https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10",
      "expected_date": "2026-07-16",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2026-10-31",
        "from": "2026-04-01"
      },
      "measurement_criterion": "OpenAI GPT-5.2/6 or equivalent (Claude/Gemini) sets new SOTA on METR Time Horizon benchmark with ≥6h 50% horizon"
    },
    {
      "kind": "llm_pre_event",
      "label": "Frontier model achieves ≥4-hour 50% time horizon",
      "source": "https://metr.org/time-horizons/ — METR exponential doubling pattern",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -5,
      "source_id": null,
      "confidence": 0.75,
      "source_url": "https://metr.org/time-horizons/",
      "expected_date": "2026-08-16",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2026-12-31",
        "from": "2026-04-01"
      },
      "measurement_criterion": "METR-published 50% time horizon for any frontier generalist agent reaches ≥4 hours (current doubles every 7 months from ~14h on coding subset)"
    },
    {
      "kind": "llm_pre_event",
      "label": "METR researchers simulate 200-hour time horizon AIs",
      "source": "https://metr.org/time-horizons/ — Thomas Kwa describes 200h tabletop",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -4,
      "source_id": null,
      "confidence": 0.55,
      "source_url": "https://metr.org/time-horizons/",
      "expected_date": "2026-12-15",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2027-06-30",
        "from": "2026-06-01"
      },
      "measurement_criterion": "METR publishes tabletop exercise or simulation report for 200-hour-horizon agents (precondition for 8-hour reliable horizon)"
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q1 window check-in (25%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -3,
      "source_id": null,
      "expected_date": "2027-01-27",
      "observed_date": null
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q2 window check-in (50%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -2,
      "source_id": null,
      "expected_date": "2027-03-26",
      "observed_date": null
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q3 window check-in (75%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -1,
      "source_id": null,
      "expected_date": "2027-05-23",
      "observed_date": null
    },
    {
      "kind": "event",
      "label": "Frontier agents reach one-workday autonomous task horizon by end 2027",

... (truncated)