SEM_049predictionAI/Softwarejobs

AI will soon fully automate software engineering, achieving massive cost reductions via iterative self-improvement.

Predictor: Alex Wissner-Gross

Prior probability

60.0%

Current probability

55.7%

evolves via intake + LBP

Conviction

4/5

Signal quality

Resolution

partial

Window

2026-01-01 – 2028-12-31

Edges in / out

8 / 1

Tickers exposed

Prediction text

AI will soon fully automate software engineering, achieving massive cost reductions via iterative self-improvement. | Long-horizon SWE benchmark results

Key catalyst: Long-horizon SWE benchmark results

Watch events: SWE-Bench + METR long-horizon eval results; enterprise code-gen market share

Resolution evidence

Status: partial

Claude Opus 4.7 + Claude Code 42-54% share of code-gen market. Cursor $1B ARR. Anthropic 'Programmer Equivalent' benchmarks showing continued capability growth.

Predictor: Alex Wissner-Gross

κ + Brier as of 2026-05-22

Full calibration →

κ (discount)

0.844

Brier

0.0341

excellent

Hits / Misses

6 / 1

of 11 resolved

Hit rate

54.5%

Calibration plot (stated vs observed)

Evidence about this node from Alex Wissner-Gross is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).

Reference class: ai_capability_milestone_2y

Linked

All classes →

AI reaches specific named capability (intern-level / world-class programmer / etc) within 2y of stated target

Base rate

—

5/15 historical

Inside weight

—

Outside weight

—

no pull

inside 55.7% → blend 55.7% (Δ 0.0pp)

Tetlock-style outside view: at TRF=1 (just predicted), outside view dominates (w_in=0.3). At TRF=0 (deadline), inside view dominates (w_in=1.0). The blend regularizes overconfident inside views toward the historical base rate.

Probability over time

7 prob_history rows

intake v2milestone miss sweeplbp propagationreference class assignedlegacy v1prior_prob (analyst seed)current = 55.7%

Milestone chain

Pre-event signals (upstream prereqs + window checkpoints) → resolution event → downstream cascades. Status/dates update from linked nodes; re-derive nightly via scripts/ops/derive_milestones.py.

Leading chain: 7 fired ✓

2026-04-16hitClaude Opus 4.7 hits 87.6% on SWE-Bench Verified
How: Anthropic Claude Opus 4.7 releases with >=85% SWE-Bench Verified score
Source: https://tokenmix.ai/blog/swe-bench-2026-claude-opus-4-7-winsconf 95%
Notes: HIT — Claude Opus 4.7 leads SWE-Bench Verified at 87.6% with 1M context.
2026-04-30hitOpenAI deprecates SWE-Bench Verified as too contaminated
How: OpenAI publishes statement that SWE-Bench Verified is no longer the frontier coding benchmark of record
Source: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/conf 95%
Notes: HIT — OpenAI explicitly retired SWE-Bench Verified, recommending SWE-Bench Pro.
2026-04-29hitNvidia quadrupled chip production output while only doubling human headcount — achieved by deploying AI coding tools (Cursor, Claude Code) a
2026-04-29hitTraining runs costing $10 billion for a single model will commence sometime in 2025.
2026-04-29hit2025 will be the definitive year that agentic systems finally hit the mainstream.
2026-04-29hitJury selection begins April 27, 2026 for Musk v OpenAI trial
2026-04-29hitRecursive self-improvement is already happening now (no longer three years out)
2026-05-01partialAI will soon fully automate software engineering, achieving massive cost reductions via iterative self-improvement.
2026-05-01hitClaude Mythos Preview crosses 90% on SWE-Bench Verified
How: Any frontier model crosses 90% accuracy on SWE-Bench Verified leaderboard
Source: https://www.marc0.dev/en/leaderboardconf 90%
Notes: HIT — Claude Mythos Preview at 93.9% as of May 2026.
2026-09-01 → 2027-12-31pendingLong-horizon multi-day software-engineering benchmark crossed by Claude/GPT
How: A frontier model completes a multi-day (>=8 hour) end-to-end real-world dev task on a public benchmark with >=70% pass rate
Source: METR / Anthropic / OpenAI benchmark releasesconf 55%
Notes: Cascade — currently METR documents <8-hour task horizon; needs 5-10x extension.
2026-09-01 → 2028-12-31pendingIterative self-improvement loop demonstrated on coding agent
How: Published paper or product launch showing AI coding agent improves its own benchmark score across iterations without human gradient updates
Source: arXiv, Anthropic/OpenAI/DeepMind research blogsconf 40%
2036-07-03pendingJob displacement will be issue 6-10 not top 5 in 10 years; AI discoveries will dominate

What if this resolves?

Clamp this prediction TRUE or FALSE and run a counterfactual Gibbs sample. Surfaces the predictions whose marginals shift most under that assumption.

(live posterior: 56%)

Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"

Evidence chain

Every probability update with full Bayesian provenance — chronological, latest first

LBP2026-05-24T02:00:02Z55.7%-7.5pp

Network propagation: 63.2% → 55.7%

4-iter LBP, residual 0.01000 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 806b02f8

intake_event_update2026-05-21T23:15:16Z63.2%+14.3pp

intake:7afeeb9a-f217-4dd2-b910-24ff14bdfc39 bayesian_v2 inside=0.632 blend=0.632 LLR=0.585 κ=0.84 no_blend

Raw metadata

{
  "trf": 0.8712612462476746,
  "kappa": 0.8438,
  "base_rate": null,
  "predictor": "Alex Wissner-Gross",
  "total_llr": 0.6931471805599453,
  "bayesian_v2": true,
  "prior_logit": -0.04500009648673667,
  "bayes_factor": "1.8:1 favoring",
  "blend_reason": "no reference_class linked",
  "inside_prior": 0.4887518739436685,
  "kappa_source": "predictor_table",
  "blend_applied": false,
  "contributions": [
    {
      "llr": 0.6931471805599453,
      "kappa": 0.8438,
      "label": "Multi-week autonomous task completion approaches 'soon fully automate' threshold.",
      "adjusted_llr": 0.5848775909564818
    }
  ],
  "evidence_kind": "intake_event_update",
  "inside_source": "history_v2",
  "inside_weight": 1,
  "outside_weight": 0,
  "posterior_prob": 0.6317839193673745,
  "evidence_origin": "daily_intake",
  "llm_suggestions": [
    {
      "polarity": "corroborates",
      "status_change": "unchanged",
      "evidence_strength": "moderate",
      "delta_prob_suggestion": 0.04
    }
  ],
  "posterior_logit": 0.5398774944697452,
  "predictor_brier": 0.03413,
  "evidence_doc_ids": [],
  "inside_posterior": 0.6317839193673745,
  "blended_posterior": 0.6317839193673745,
  "reference_class_id": null,
  "total_adjusted_llr": 0.5848775909564818,
  "predictor_n_resolved": 11
}

LBP2026-05-10T02:00:02Z48.9%-1.5pp

Network propagation: 50.3% → 48.9%

6-iter LBP, residual 0.00584 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run e5c18d29

LBP2026-05-03T02:00:01Z50.3%-2.3pp

Network propagation: 52.6% → 50.4%

6-iter LBP, residual 0.00677 · damping 0.5, w_intrinsic 0.5 · method lbp_v3 · run 1a683ac9

resolution_terminal2026-05-01T00:00:00Z50.0%-2.6pp

resolution_terminal partial outcome=0.5 pre_resolution=0.526

Raw metadata

{
  "source": "backfill_resolution_history.py",
  "status": "partial",
  "bayesian_v2": false,
  "outcome_prob": 0.5,
  "evidence_kind": "resolution_terminal",
  "posterior_prob": 0.5,
  "delta_to_outcome": -0.026029999999999998,
  "inside_posterior": 0.52603,
  "validation_notes": "Claude Opus 4.7 + Claude Code 42-54% share of code-gen market. Cursor $1B ARR. Anthropic 'Programmer Equivalent' benchmarks showing continued capability growth.",
  "validation_status": "hit",
  "pre_resolution_prob": 0.52603,
  "resolution_evidence": "Claude Opus 4.7 + Claude Code 42-54% share of code-gen market. Cursor $1B ARR. Anthropic 'Programmer Equivalent' benchmarks showing continued capability growth.",
  "does_not_update_current_prob": true
}

LBP2026-04-30T16:39:51Z52.6%-3.2pp

Network propagation: 55.8% → 52.6%

5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v2 · run 0c8a4ea3

LBP2026-04-30T02:18:57Z55.8%-4.2pp

Network propagation: 60.0% → 55.8%

5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v1 · run 592311ef

Network propagation neighbors

Top edges sorted by latest LBP cross-impact

All propagation →

Top incoming (parents)

Edges that influence THIS node's belief

Kind	Node	Their prob	P(c\|s=T)	P(c\|s=F)	Δ implied
prereq	247_058 Jury selection begins April 27, 2026 for Musk v OpenAI trial — Peter Diamandis	71.4%	0.600	0.050	-0.119
prereq	SEM_042 2025 will be the definitive year that agentic systems finall — Kevin Weil	73.8%	0.600	0.050	-0.106
prereq	SEM_012 Nvidia quadrupled chip production output while only doubling — Jensen Huang	75.0%	0.600	0.050	-0.098
killer	TK04 Macro Recession 2026-27 (Structural Deleveraging)	25.0%	0.050	0.600	-0.094
prereq	SEM_008 Training runs costing $10 billion for a single model will co — Dario Amodei	76.9%	0.600	0.050	-0.089

Top outgoing (children)

Predictions THIS node influences

Kind	Node	Their prob	P(c\|s=T)	P(c\|s=F)	Δ implied
prereq	234_036 Job displacement will be issue 6-10 not top 5 in 10 years; A — Alex Wissner-Gross	28.8%	0.450	0.050	-0.033

Ticker exposure

32 ticker(s) linked

Beneficiaries (23)

ADUS COUR DOCN FROG GTLB INOD PL ROL SPIR SRFM UDMY TEAM NFLX PLTR RDDT UBER AMZN BABA SPOT GDDY GOOGL META MSFT

Adverse (6)

RHI BXP SLG MAN KFY TNET

Prerequisites (8)

Predictions that must hit first

Type	Pred	Title	Domain	Lag
prereq	238_009	Recursive self-improvement is already happening now (no longer three years out)	AI	—
prereq	SEM_008	Training runs costing $10 billion for a single model will commence sometime in 2025.	AI	—
prereq	247_058	Jury selection begins April 27, 2026 for Musk v OpenAI trial	AI	—
prereq	SEM_012	Nvidia quadrupled chip production output while only doubling human headcount — achieved by deploying AI coding tools (Cursor, Claude Code) across engineering.	AI/Manufacturing	—
prereq	SEM_042	2025 will be the definitive year that agentic systems finally hit the mainstream.	AI/Agents	—
correlate	S_NO_AI_PAUSE_5Y	No major AI pause through 2031	ai_regulatory_pause	—
killer	TK04	Macro Recession 2026-27 (Structural Deleveraging)	—	—
killer	TK07	Labor Political Backlash (UBI Mandate / AI Tax)	—	—

Dependents (1)

Predictions enabled by this

Type	Pred	Title	Domain	Lag
prereq	234_036	Job displacement will be issue 6-10 not top 5 in 10 years; AI discoveries will dominate	Labor/Jobs	—

Expected milestones (1)

From Sheet 17 Monitoring Triggers

Expected by	Description	Status
2027-09-30	[Capability 2027-09] de / Cursor enterprise adoption metrics [SEM_049] SWE-Bench + METR long-horizon eval results; enterprise code-gen market share	pending

Validations (1)

Resolution events

Observed at	Status	By	Notes
2026-04-29	hit	thesis_timeline_v1.0_import	Claude Opus 4.7 + Claude Code 42-54% share of code-gen market. Cursor $1B ARR. Anthropic 'Programmer Equivalent' benchmarks showing continued capability growth.

Linked documents (10)

Auto-generated by cosine similarity from Polymarket / Manifold / EDGAR / GDELT

Sim	Source	Title	Market prob	Polarity	Reviewed	Published
0.730	manifold	Is ai going to be self repairing?	10%	mentions	pending	2026-05-25
0.730	arxiv	The End of Software Engineering: How AI Agents Are Fundamentally Restructuring the Software Paradigm	—	mentions	pending	2026-06-04
0.725	arxiv	AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development	—	mentions	pending	2026-05-04
0.702	arxiv	Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering	—	mentions	pending	2026-06-04
0.692	arxiv	Enhancing Software Engineering Through Closed-Loop Memory Optimization	—	mentions	pending	2026-06-04
0.684	arxiv	ProgramBench: Can Language Models Rebuild Programs From Scratch?	—	mentions	pending	2026-05-05
0.678	arxiv	Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies	—	mentions	pending	2026-05-05
0.670	arxiv	Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean	—	mentions	pending	2026-06-03
0.669	arxiv	Microskill Architecture: A Modular Skill-Driven Framework for AI-Native Code Generation	—	mentions	pending	2026-06-04
0.668	arxiv	Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency	—	mentions	pending	2026-05-28

Raw metadata

From Thesis_Timeline_v1.0_FINAL workbook

{
  "nia": false,
  "mode": "FORECAST",
  "role": "Guest-VC/Physicist",
  "context": "Wissner-Gross extrapolates hardware-software recursion (Nvidia 4x/2x) into full software-engineering automation within near-term horizon.",
  "to_year": 2028,
  "conv_cues": "will soon; iterative self-improvement",
  "direction": "HAPPEN",
  "from_year": 2026,
  "timeframe": "near-term",
  "conv_level": "HIGH",
  "milestones": [
    {
      "kind": "llm_pre_event",
      "label": "Claude Opus 4.7 hits 87.6% on SWE-Bench Verified",
      "notes": "HIT — Claude Opus 4.7 leads SWE-Bench Verified at 87.6% with 1M context.",
      "source": "https://tokenmix.ai/blog/swe-bench-2026-claude-opus-4-7-wins",
      "status": "hit",
      "weight": 0.4,
      "ordinal": -7,
      "source_id": null,
      "confidence": 0.95,
      "source_url": "https://tokenmix.ai/blog/swe-bench-2026-claude-opus-4-7-wins",
      "expected_date": "2026-04-16",
      "observed_date": "2026-04-16",
      "research_origin": "deep_research",
      "measurement_criterion": "Anthropic Claude Opus 4.7 releases with >=85% SWE-Bench Verified score"
    },
    {
      "kind": "llm_pre_event",
      "label": "OpenAI deprecates SWE-Bench Verified as too contaminated",
      "notes": "HIT — OpenAI explicitly retired SWE-Bench Verified, recommending SWE-Bench Pro.",
      "source": "https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/",
      "status": "hit",
      "weight": 0.4,
      "ordinal": -6,
      "source_id": null,
      "confidence": 0.95,
      "source_url": "https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/",
      "expected_date": "2026-04-16",
      "observed_date": "2026-04-30",
      "research_origin": "deep_research",
      "expected_date_range": {
        "to": "2026-06-30",
        "from": "2026-02-01"
      },
      "measurement_criterion": "OpenAI publishes statement that SWE-Bench Verified is no longer the frontier coding benchmark of record"
    },
    {
      "kind": "prereq",
      "label": "Nvidia quadrupled chip production output while only doubling human headcount — achieved by deploying AI coding tools (Cursor, Claude Code) a",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -5,
      "source_id": "SEM_012",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "prereq",
      "label": "Training runs costing $10 billion for a single model will commence sometime in 2025.",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -4,
      "source_id": "SEM_008",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "prereq",
      "label": "2025 will be the definitive year that agentic systems finally hit the mainstream.",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -3,
      "source_id": "SEM_042",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "prereq",
      "label": "Jury selection begins April 27, 2026 for Musk v OpenAI trial",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -2,
      "source_id": "247_058",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "prereq",
      "label": "Recursive self-improvement is already happening now (no longer three years out)",
      "status": "hit",
      "weight": 0.5,
      "ordinal": -1,
      "source_id": "238_009",
      "expected_date": "2026-04-29",
      "observed_date": "2026-04-29"
    },
    {
      "kind": "event",
      "label": "AI will soon fully automate software engineering, achieving massive cost reductions via iterative self-improvement.",
      "status": "partial",
      "weight": 1,
      "ordinal": 0,
      "source_id": "SEM_049",
      "expected_date": "2026-05-01",
      "observed_date": "2026-05-01"
    },
    {
      "kind": "llm_pre_event",
      "label": "Claude Mythos Preview crosses 90% on SWE-Bench Verified",

... (truncated)