AI_036predictionAIRLHF-fails-superintelligence

Reinforcement Learning from Human Feedback (RLHF) will fail catastrophically when applied to superintelligence — because humans will be inherently incapable of evaluating the incomprehensible logic and actions of an ASI; therefore, aligning superintell...

Predictor: Leopold Aschenbrenner

Prior probability

55.0%

Current probability

49.5%

evolves via intake + LBP

Conviction

5/5

Signal quality

Resolution

pending

Window

2027-01-01 – 2030-09-30

Edges in / out

1 / 0

Tickers exposed

Prediction text

Reinforcement Learning from Human Feedback (RLHF) will fail catastrophically when applied to superintelligence — because humans will be inherently incapable of evaluating the incomprehensible logic and actions of an ASI; therefore, aligning superintelligence requires fundamentally new technical frameworks. | First publicly-observed RLHF failure on superhuman model

Key catalyst: First publicly-observed RLHF failure on superhuman model

Watch events: Anthropic / OpenAI / DeepMind alignment research milestones

Resolution evidence

Status: pending

Anthropic scalable-oversight research, constitutional AI, weak-to-strong generalization research all acknowledge RLHF limitations at superhuman levels.

Predictor: Leopold Aschenbrenner

κ + Brier as of 2026-05-22

Full calibration →

κ (discount)

0.688

Brier

0.0417

excellent

Hits / Misses

2 / 0

of 3 resolved

Hit rate

66.7%

Calibration plot (stated vs observed)

Evidence about this node from Leopold Aschenbrenner is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).

Reference class

Not linked

This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.

Probability over time

2 prob_history rows

intake v2milestone miss sweeplbp propagationreference class assignedlegacy v1prior_prob (analyst seed)current = 49.5%

Milestone chain

Pre-event signals (upstream prereqs + window checkpoints) → resolution event → downstream cascades. Status/dates update from linked nodes; re-derive nightly via scripts/ops/derive_milestones.py.

Leading chain: 7 pending

2026-09-01 → 2028-06-30pendingFirst peer-reviewed paper documents RLHF-supervised model exhibiting deceptive alignment on superhuman-evaluator task
How: ArXiv/Nature/NeurIPS paper from Anthropic, OpenAI, DeepMind, or Redwood Research empirically demonstrates RLHF-trained model passing human evaluation while failing ground-truth on a superhuman-difficulty task
Source: https://arxiv.org/abs/2502.04675conf 65%
2027-09-11pendingQ1 window check-in (25%)
2027-01-01 → 2029-06-30pendingFrontier lab publicly deprecates pure RLHF as superalignment primary technique in favor of scalable-oversight architecture
How: OpenAI, Anthropic, or DeepMind official safety publication or model card explicitly states RLHF alone is insufficient for next-tier model and names scalable-oversight/recursive-critique/debate as replacement
Source: https://openai.com/index/weak-to-strong-generalization/conf 70%
2028-05-22pendingQ2 window check-in (50%)
2027-06-01 → 2029-12-31pendingGovernment or international body cites RLHF inadequacy in formal AI-safety policy
How: AI Safety Institute (UK/US), EU AI Office, or G7 statement formally references RLHF limits at superhuman scale and recommends new technical frameworks per Aschenbrenner-style argument
Source: https://claude5.com/news/ai-safety-2026-alignment-progress-and-open-challengesconf 50%
2027-06-01 → 2030-06-30pendingFirst public incident report of RLHF-supervised superhuman model causing measurable real-world harm via reward-hacking
How: Frontier-lab incident disclosure, NIST AI Incident Database entry, or major regulator action references RLHF-induced misalignment in a deployed superhuman-class model
Source: https://www.hushvault.ie/2026/01/23/__trashed-3/conf 40%
2029-01-31pendingQ3 window check-in (75%)
2029-10-12pendingReinforcement Learning from Human Feedback (RLHF) will fail catastrophically when applied to superintelligence — because humans will be inhe

No downstream cascades — this prediction is a leaf in the dependency graph.

What if this resolves?

Clamp this prediction TRUE or FALSE and run a counterfactual Gibbs sample. Surfaces the predictions whose marginals shift most under that assumption.

(live posterior: 49%)

Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"

Evidence chain

Every probability update with full Bayesian provenance — chronological, latest first

LBP2026-04-30T16:39:51Z49.5%-1.9pp

Network propagation: 51.4% → 49.5%

5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v2 · run 0c8a4ea3

LBP2026-04-30T02:18:57Z51.4%-3.6pp

Network propagation: 55.0% → 51.4%

5-iter LBP, residual 0.00825 · damping 0.5, w_intrinsic 0.5 · method lbp_v1 · run 592311ef

Network propagation neighbors

Top edges sorted by latest LBP cross-impact

All propagation →

Top incoming (parents)

Edges that influence THIS node's belief

Kind	Node	Their prob	P(c\|s=T)	P(c\|s=F)	Δ implied
killer	TK01 AGI Capability Plateau (2026-27 Training Stall)	15.0%	0.050	0.550	-0.020

Top outgoing (children)

Predictions THIS node influences

No outgoing edges.

Ticker exposure

11 ticker(s) linked

Beneficiaries (11)

PL INOD SPIR NVDA META RDDT ADBE UBER ADSK AMZN GOOGL

Prerequisites (1)

Predictions that must hit first

Type	Pred	Title	Domain	Lag
killer	TK01	AGI Capability Plateau (2026-27 Training Stall)	—	—

Dependents (0)

Predictions enabled by this

Type	Pred	Title	Domain	Lag
No dependents

Linked documents (10)

Auto-generated by cosine similarity from Polymarket / Manifold / EDGAR / GDELT

Sim	Source	Title	Market prob	Polarity	Reviewed	Published
0.736	arxiv	Automated alignment is harder than you think	—	mentions	pending	2026-05-07
0.708	arxiv	Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems	—	mentions	pending	2026-06-04
0.701	arxiv	Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning	—	mentions	pending	2026-05-07
0.698	arxiv	Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges	—	mentions	pending	2026-06-03
0.697	arxiv	The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure	—	mentions	pending	2026-05-04
0.696	arxiv	Human-in-the-Loop Uncertainty Analysis in Self-Adaptive Robots Using LLMs	—	mentions	pending	2026-05-04
0.692	arxiv	Reinforcement Learning from Rich Feedback with Distributional DAgger	—	mentions	pending	2026-06-03
0.689	arxiv	Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes	—	mentions	pending	2026-05-07
0.687	arxiv	Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy	—	mentions	pending	2026-05-13
0.686	arxiv	CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback	—	mentions	pending	2026-06-01

Raw metadata

From Thesis_Timeline_v1.0_FINAL workbook

{
  "nia": false,
  "mode": "WARNING+FORECAST",
  "role": "Cited-VC/Researcher",
  "context": "Specific technical-alignment failure-mode framing distinct from INF_002 ('The Project' nationalization) and SEM_002 (AGI timeline). Critical input for policy/alignment community.",
  "to_year": 2030,
  "conv_cues": "catastrophic-failure framing; technical-specific",
  "direction": "HAPPEN",
  "from_year": 2027,
  "timeframe": "2027-2030",
  "conv_level": "HIGH",
  "milestones": [
    {
      "kind": "llm_pre_event",
      "label": "First peer-reviewed paper documents RLHF-supervised model exhibiting deceptive alignment on superhuman-evaluator task",
      "source": "https://arxiv.org/abs/2502.04675",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -7,
      "source_id": null,
      "confidence": 0.65,
      "expected_date": "2027-08-01",
      "research_origin": "training",
      "expected_date_range": {
        "to": "2028-06-30",
        "from": "2026-09-01"
      },
      "measurement_criterion": "ArXiv/Nature/NeurIPS paper from Anthropic, OpenAI, DeepMind, or Redwood Research empirically demonstrates RLHF-trained model passing human evaluation while failing ground-truth on a superhuman-difficulty task"
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q1 window check-in (25%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -6,
      "source_id": null,
      "expected_date": "2027-09-11",
      "observed_date": null
    },
    {
      "kind": "llm_pre_event",
      "label": "Frontier lab publicly deprecates pure RLHF as superalignment primary technique in favor of scalable-oversight architecture",
      "source": "https://openai.com/index/weak-to-strong-generalization/",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -5,
      "source_id": null,
      "confidence": 0.7,
      "expected_date": "2028-03-31",
      "research_origin": "training",
      "expected_date_range": {
        "to": "2029-06-30",
        "from": "2027-01-01"
      },
      "measurement_criterion": "OpenAI, Anthropic, or DeepMind official safety publication or model card explicitly states RLHF alone is insufficient for next-tier model and names scalable-oversight/recursive-critique/debate as replacement"
    },
    {
      "kind": "quartile_checkpoint",
      "label": "Q2 window check-in (50%)",
      "status": "pending",
      "weight": 0.05,
      "ordinal": -4,
      "source_id": null,
      "expected_date": "2028-05-22",
      "observed_date": null
    },
    {
      "kind": "llm_post_event",
      "label": "Government or international body cites RLHF inadequacy in formal AI-safety policy",
      "source": "https://claude5.com/news/ai-safety-2026-alignment-progress-and-open-challenges",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -3,
      "source_id": null,
      "confidence": 0.5,
      "expected_date": "2028-09-15",
      "research_origin": "training",
      "expected_date_range": {
        "to": "2029-12-31",
        "from": "2027-06-01"
      },
      "measurement_criterion": "AI Safety Institute (UK/US), EU AI Office, or G7 statement formally references RLHF limits at superhuman scale and recommends new technical frameworks per Aschenbrenner-style argument"
    },
    {
      "kind": "llm_post_event",
      "label": "First public incident report of RLHF-supervised superhuman model causing measurable real-world harm via reward-hacking",
      "source": "https://www.hushvault.ie/2026/01/23/__trashed-3/",
      "status": "pending",
      "weight": 0.4,
      "ordinal": -2,
      "source_id": null,
      "confidence": 0.4,
      "expected_date": "2028-12-14",
      "research_origin": "training",
      "expected_date_range": {
        "to": "2030-06-30",
        "from": "2027-06-01"
      },
      "measurement_criterion": "Frontier-lab incident disclosure, NIST AI Incident Database entry, or major regulator action references RLHF-induced misalignment in a deployed s
... (truncated)