← Cockpit
AI_036predictionAIRLHF-fails-superintelligence

Reinforcement Learning from Human Feedback (RLHF) will fail catastrophically when applied to superintelligence — because humans will be inherently incapable of evaluating the incomprehensible logic and actions of an ASI; therefore, aligning superintell...

Predictor: Leopold Aschenbrenner

Prior probability
55.0%
Current probability
49.5%
evolves via intake + LBP
Conviction
5/5
Signal quality
A
Resolution
pending
Window
2027-01-01 – 2030-09-30
Edges in / out
1 / 0
Tickers exposed
11

Prediction text

Reinforcement Learning from Human Feedback (RLHF) will fail catastrophically when applied to superintelligence — because humans will be inherently incapable of evaluating the incomprehensible logic and actions of an ASI; therefore, aligning superintelligence requires fundamentally new technical frameworks. | First publicly-observed RLHF failure on superhuman model

Key catalyst: First publicly-observed RLHF failure on superhuman model

Watch events: Anthropic / OpenAI / DeepMind alignment research milestones

Resolution evidence

Status: pending

Anthropic scalable-oversight research, constitutional AI, weak-to-strong generalization research all acknowledge RLHF limitations at superhuman levels.

Predictor: Leopold Aschenbrenner

κ + Brier as of 2026-05-22
κ (discount)
0.688
Brier
0.0417
excellent
Hits / Misses
2 / 0
of 3 resolved
Hit rate
66.7%
Calibration plot (stated vs observed)

Evidence about this node from Leopold Aschenbrenner is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).

Reference class

Not linked

This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.

Probability over time

2 prob_history rows
0%25%50%75%100%prior 55%2026-04-302026-04-30
intake v2milestone miss sweeplbp propagationreference class assignedlegacy v1prior_prob (analyst seed)current = 49.5%

Milestone chain

Pre-event signals (upstream prereqs + window checkpoints) → resolution event → downstream cascades. Status/dates update from linked nodes; re-derive nightly via scripts/ops/derive_milestones.py.
Leading chain: 7 pending
  1. 2026-09-01 → 2028-06-30pendingFirst peer-reviewed paper documents RLHF-supervised model exhibiting deceptive alignment on superhuman-evaluator task
    How: ArXiv/Nature/NeurIPS paper from Anthropic, OpenAI, DeepMind, or Redwood Research empirically demonstrates RLHF-trained model passing human evaluation while failing ground-truth on a superhuman-difficulty task
    Source: https://arxiv.org/abs/2502.04675conf 65%
  2. 2027-09-11pendingQ1 window check-in (25%)
  3. 2027-01-01 → 2029-06-30pendingFrontier lab publicly deprecates pure RLHF as superalignment primary technique in favor of scalable-oversight architecture
    How: OpenAI, Anthropic, or DeepMind official safety publication or model card explicitly states RLHF alone is insufficient for next-tier model and names scalable-oversight/recursive-critique/debate as replacement
    Source: https://openai.com/index/weak-to-strong-generalization/conf 70%
  4. 2028-05-22pendingQ2 window check-in (50%)
  5. 2027-06-01 → 2029-12-31pendingGovernment or international body cites RLHF inadequacy in formal AI-safety policy
    How: AI Safety Institute (UK/US), EU AI Office, or G7 statement formally references RLHF limits at superhuman scale and recommends new technical frameworks per Aschenbrenner-style argument
    Source: https://claude5.com/news/ai-safety-2026-alignment-progress-and-open-challengesconf 50%
  6. 2027-06-01 → 2030-06-30pendingFirst public incident report of RLHF-supervised superhuman model causing measurable real-world harm via reward-hacking
    How: Frontier-lab incident disclosure, NIST AI Incident Database entry, or major regulator action references RLHF-induced misalignment in a deployed superhuman-class model
    Source: https://www.hushvault.ie/2026/01/23/__trashed-3/conf 40%
  7. 2029-01-31pendingQ3 window check-in (75%)

No downstream cascades — this prediction is a leaf in the dependency graph.

What if this resolves?

Clamp this prediction TRUE or FALSE and run a counterfactual Gibbs sample. Surfaces the predictions whose marginals shift most under that assumption.
(live posterior: 49%)

Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"

Evidence chain

Every probability update with full Bayesian provenance — chronological, latest first
LBP2026-04-30T16:39:51Z49.5%-1.9pp
Network propagation: 51.4% → 49.5%
5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v2 · run 0c8a4ea3
LBP2026-04-30T02:18:57Z51.4%-3.6pp
Network propagation: 55.0% → 51.4%
5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v1 · run 592311ef

Network propagation neighbors

Top edges sorted by latest LBP cross-impact
All propagation →

Top incoming (parents)

Edges that influence THIS node's belief

KindNodeTheir probP(c|s=T)P(c|s=F)Δ implied
killerTK01
AGI Capability Plateau (2026-27 Training Stall)
15.0%0.0500.550-0.020

Top outgoing (children)

Predictions THIS node influences

No outgoing edges.

Ticker exposure

11 ticker(s) linked

Beneficiaries (11)

PLINODSPIRNVDAMETARDDTADBEUBERADSKAMZNGOOGL

Prerequisites (1)

Predictions that must hit first
TypePredTitleDomainLag
killerTK01AGI Capability Plateau (2026-27 Training Stall)

Dependents (0)

Predictions enabled by this
TypePredTitleDomainLag
No dependents

Linked documents (10)

Auto-generated by cosine similarity from Polymarket / Manifold / EDGAR / GDELT

Raw metadata

From Thesis_Timeline_v1.0_FINAL workbook
{
  "nia": false,
  "mode": "WARNING+FORECAST",
  "role": "Cited-VC/Researcher",
  "context": "Specific technical-alignment failure-mode framing distinct from INF_002 ('The Project' nationalization) and SEM_002 (AGI timeline). Critical input for policy/alignment community.",
  "to_year": 2030,
  "conv_cues": "catastrophic-failure framing; technical-specific",
  "direction": "HAPPEN",
  "from_year": 2027,
  "timeframe": "2027-2030",
  "conv_level": "HIGH",
  "milestones": [
    {
      "kind": "llm_pre_event",
      "label": "First peer-reviewed paper documents RLHF-supervised model exhibiting deceptive alignment on superhuman-evaluator task",
      "source": "https://arxiv.org/abs/2502.04675",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -7,
      "source_id": null,
      "confidence": 0.65,
      "expected_date": "2027-08-01",
      "research_origin": "training",
      "expected_date_range": {
        "to": "2028-06-30",
        "from": "2026-09-01"
      },
      "measurement_criterion": "ArXiv/Nature/NeurIPS paper from Anthropic, OpenAI, DeepMind, or Redwood Research empirically demonstrates RLHF-trained model passing human evaluation while failing ground-truth on a superhuman-difficulty task"
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q1 window check-in (25%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -6,
      "source_id": null,
      "expected_date": "2027-09-11",
      "observed_date": null
    },
    {
      "kind": "llm_pre_event",
      "label": "Frontier lab publicly deprecates pure RLHF as superalignment primary technique in favor of scalable-oversight architecture",
      "source": "https://openai.com/index/weak-to-strong-generalization/",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -5,
      "source_id": null,
      "confidence": 0.7,
      "expected_date": "2028-03-31",
      "research_origin": "training",
      "expected_date_range": {
        "to": "2029-06-30",
        "from": "2027-01-01"
      },
      "measurement_criterion": "OpenAI, Anthropic, or DeepMind official safety publication or model card explicitly states RLHF alone is insufficient for next-tier model and names scalable-oversight/recursive-critique/debate as replacement"
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q2 window check-in (50%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -4,
      "source_id": null,
      "expected_date": "2028-05-22",
      "observed_date": null
    },
    {
      "kind": "llm_post_event",
      "label": "Government or international body cites RLHF inadequacy in formal AI-safety policy",
      "source": "https://claude5.com/news/ai-safety-2026-alignment-progress-and-open-challenges",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -3,
      "source_id": null,
      "confidence": 0.5,
      "expected_date": "2028-09-15",
      "research_origin": "training",
      "expected_date_range": {
        "to": "2029-12-31",
        "from": "2027-06-01"
      },
      "measurement_criterion": "AI Safety Institute (UK/US), EU AI Office, or G7 statement formally references RLHF limits at superhuman scale and recommends new technical frameworks per Aschenbrenner-style argument"
    },
    {
      "kind": "llm_post_event",
      "label": "First public incident report of RLHF-supervised superhuman model causing measurable real-world harm via reward-hacking",
      "source": "https://www.hushvault.ie/2026/01/23/__trashed-3/",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -2,
      "source_id": null,
      "confidence": 0.4,
      "expected_date": "2028-12-14",
      "research_origin": "training",
      "expected_date_range": {
        "to": "2030-06-30",
        "from": "2027-06-01"
      },
      "measurement_criterion": "Frontier-lab incident disclosure, NIST AI Incident Database entry, or major regulator action references RLHF-induced misalignment in a deployed s
... (truncated)