COD_AI_004predictionAIautonomous-agents
Frontier agents reach one-workday autonomous task horizon by end 2027
Predictor: Codex Research Pack
Prior probability
55.0%
Current probability
36.0%
evolves via intake + LBP
Conviction
4/5
Signal quality
—
Resolution
pending
Window
2026-12-01 – 2027-12-31
Edges in / out
4 / 0
Tickers exposed
10
Prediction text
Frontier agents reach one-workday autonomous task horizon by end 2027
Predictor: Codex Research Pack
κ + Brier as of 2026-05-22
κ (discount)
0.850
Brier
—
Hits / Misses
0 / 0
Hit rate
—
Evidence about this node from Codex Research Pack is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).
Reference class
Not linked
This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.
Probability over time
4 prob_history rows
intake v2milestone miss sweeplbp propagationreference class assignedlegacy v1prior_prob (analyst seed)current = 36.0%
Milestone chain
Pre-event signals (upstream prereqs + window checkpoints) → resolution event → downstream cascades. Status/dates update from linked nodes; re-derive nightly via scripts/ops/derive_milestones.py.
Leading chain: 1 fired ✓ · 6 pending
- 2026-01-29hitMETR Time Horizon 1.1 framework released January 2026How: METR publishes Time Horizon 1.1 evaluation expanding suite by 34% (228/170 tasks) and doubling 8+ hour tasks (31/14)Source: https://metr.org/blog/2026-1-29-time-horizon-1-1/ — METR Time Horizon 1.1 releaseconf 99%Notes: HIT — METR upgraded benchmark suite to handle longer-horizon evals before this prediction's window opens.
- 2026-04-01 → 2026-10-31pendingGPT-5.2 or successor leads METR Time Horizon testHow: OpenAI GPT-5.2/6 or equivalent (Claude/Gemini) sets new SOTA on METR Time Horizon benchmark with ≥6h 50% horizonSource: https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10 — GPT-5.2 leading METR testconf 70%
- 2026-04-01 → 2026-12-31pendingFrontier model achieves ≥4-hour 50% time horizonHow: METR-published 50% time horizon for any frontier generalist agent reaches ≥4 hours (current doubles every 7 months from ~14h on coding subset)Source: https://metr.org/time-horizons/ — METR exponential doubling patternconf 75%
- 2026-06-01 → 2027-06-30pendingMETR researchers simulate 200-hour time horizon AIsHow: METR publishes tabletop exercise or simulation report for 200-hour-horizon agents (precondition for 8-hour reliable horizon)Source: https://metr.org/time-horizons/ — Thomas Kwa describes 200h tabletopconf 55%
- 2027-01-27pendingQ1 window check-in (25%)
- 2027-03-26pendingQ2 window check-in (50%)
- 2027-05-23pendingQ3 window check-in (75%)
- 2027-06-01 → 2027-12-31pendingFrontier agent demonstrates ≥8h task with ≥50% reliability (resolution)How: METR-style public eval shows generalist frontier agent completes 8+ hour expert tasks at ≥50% successSource: https://metr.org/time-horizons/ — 7-month doubling implies 8h horizon in late 2027conf 55%Notes: Direct resolution criterion per Codex pack. Doubling cadence supports late-2027 plausibility.
- 2027-06-01 → 2028-06-30pendingReal-world agent deployment (one-workday autonomous loops)How: ≥1 enterprise (Anthropic, OpenAI, Cognition Devin, etc.) discloses production agent running ≥8h continuous tasks with measurable reliabilitySource: Anthropic/OpenAI product blogs, Cognition releasesconf 45%
What if this resolves?
Clamp this prediction TRUE or FALSE and run a counterfactual Gibbs sample. Surfaces the predictions whose marginals shift most under that assumption.
(live posterior: 36%)
Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"
Evidence chain
Every probability update with full Bayesian provenance — chronological, latest first
LBP2026-05-24T02:00:02Z36.0%-1.2pp
Network propagation: 37.2% → 36.0%
4-iter LBP, residual 0.01000 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 806b02f8
LBP2026-05-17T02:00:01Z37.2%-2.5pp
Network propagation: 39.7% → 37.2%
5-iter LBP, residual 0.00689 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run e607fa96
LBP2026-05-10T02:00:02Z39.7%-5.1pp
Network propagation: 44.7% → 39.7%
6-iter LBP, residual 0.00584 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run e5c18d29
LBP2026-05-03T02:00:01Z44.7%-10.3pp
Network propagation: 55.0% → 44.7%
6-iter LBP, residual 0.00677 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 1a683ac9
Network propagation neighbors
Top edges sorted by latest LBP cross-impact
Top incoming (parents)
Edges that influence THIS node's belief
| Kind | Node | Their prob | P(c|s=T) | P(c|s=F) | Δ implied |
|---|---|---|---|---|---|
| prereq | S_AGI_MID_2029 AGI mid: Kurzweil 2029 path | 35.0% | 0.550 | 0.050 | -0.135 |
Top outgoing (children)
Predictions THIS node influences
No outgoing edges.
Ticker exposure
10 ticker(s) linked
Beneficiaries (5)
Adverse (5)
Prerequisites (4)
Predictions that must hit first
| Type | Pred | Title | Domain | Lag |
|---|---|---|---|---|
| prereq | S_AGI_MID_2029 | AGI mid: Kurzweil 2029 path | agi_general_capability | — |
| correlate | S_AGI_FAST_2027 | AGI fast: drop-in remote worker by 2027-09 | agi_general_capability | — |
| correlate | S_ROBOTAXI_MASS_2030 | Robotaxi >10% urban miles by Nov 2030 | robotaxi_deployment | — |
| correlate | S_AI_PAUSE_2026 | Major-country AI pause beginning 2026 | ai_regulatory_pause | — |
Dependents (0)
Predictions enabled by this
| Type | Pred | Title | Domain | Lag |
|---|---|---|---|---|
| No dependents | ||||
Linked documents (10)
Auto-generated by cosine similarity from Polymarket / Manifold / EDGAR / GDELT
Raw metadata
From Thesis_Timeline_v1.0_FINAL workbook
{
"pack_id": "codex_research_event_pack_2026_04_30",
"milestones": [
{
"kind": "llm_pre_event",
"label": "METR Time Horizon 1.1 framework released January 2026",
"notes": "HIT — METR upgraded benchmark suite to handle longer-horizon evals before this prediction's window opens.",
"source": "https://metr.org/blog/2026-1-29-time-horizon-1-1/ — METR Time Horizon 1.1 release",
"status": "hit",
"weight": 0.4,
"ordinal": -7,
"source_id": null,
"confidence": 0.99,
"source_url": "https://metr.org/blog/2026-1-29-time-horizon-1-1/",
"expected_date": "2026-01-29",
"observed_date": "2026-01-29",
"research_origin": "deep_research",
"measurement_criterion": "METR publishes Time Horizon 1.1 evaluation expanding suite by 34% (228/170 tasks) and doubling 8+ hour tasks (31/14)"
},
{
"kind": "llm_pre_event",
"label": "GPT-5.2 or successor leads METR Time Horizon test",
"source": "https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10 — GPT-5.2 leading METR test",
"status": "pending",
"weight": 0.4,
"ordinal": -6,
"source_id": null,
"confidence": 0.7,
"source_url": "https://medium.com/coding-nexus/gpt-5-2-autonomy-leading-the-metrs-time-horizon-test-83d132b92c10",
"expected_date": "2026-07-16",
"research_origin": "deep_research",
"expected_date_range": {
"to": "2026-10-31",
"from": "2026-04-01"
},
"measurement_criterion": "OpenAI GPT-5.2/6 or equivalent (Claude/Gemini) sets new SOTA on METR Time Horizon benchmark with ≥6h 50% horizon"
},
{
"kind": "llm_pre_event",
"label": "Frontier model achieves ≥4-hour 50% time horizon",
"source": "https://metr.org/time-horizons/ — METR exponential doubling pattern",
"status": "pending",
"weight": 0.4,
"ordinal": -5,
"source_id": null,
"confidence": 0.75,
"source_url": "https://metr.org/time-horizons/",
"expected_date": "2026-08-16",
"research_origin": "deep_research",
"expected_date_range": {
"to": "2026-12-31",
"from": "2026-04-01"
},
"measurement_criterion": "METR-published 50% time horizon for any frontier generalist agent reaches ≥4 hours (current doubles every 7 months from ~14h on coding subset)"
},
{
"kind": "llm_pre_event",
"label": "METR researchers simulate 200-hour time horizon AIs",
"source": "https://metr.org/time-horizons/ — Thomas Kwa describes 200h tabletop",
"status": "pending",
"weight": 0.4,
"ordinal": -4,
"source_id": null,
"confidence": 0.55,
"source_url": "https://metr.org/time-horizons/",
"expected_date": "2026-12-15",
"research_origin": "deep_research",
"expected_date_range": {
"to": "2027-06-30",
"from": "2026-06-01"
},
"measurement_criterion": "METR publishes tabletop exercise or simulation report for 200-hour-horizon agents (precondition for 8-hour reliable horizon)"
},
{
"kind": "quartile_checkpoint",
"label": "Q1 window check-in (25%)",
"status": "pending",
"weight": 0.05,
"ordinal": -3,
"source_id": null,
"expected_date": "2027-01-27",
"observed_date": null
},
{
"kind": "quartile_checkpoint",
"label": "Q2 window check-in (50%)",
"status": "pending",
"weight": 0.05,
"ordinal": -2,
"source_id": null,
"expected_date": "2027-03-26",
"observed_date": null
},
{
"kind": "quartile_checkpoint",
"label": "Q3 window check-in (75%)",
"status": "pending",
"weight": 0.05,
"ordinal": -1,
"source_id": null,
"expected_date": "2027-05-23",
"observed_date": null
},
{
"kind": "event",
"label": "Frontier agents reach one-workday autonomous task horizon by end 2027",
... (truncated)