Reinforcement Learning from Human Feedback (RLHF) will fail catastrophically when applied to superintelligence — because humans will be inherently incapable of evaluating the incomprehensible logic and actions of an ASI; therefore, aligning superintell...
Predictor: Leopold Aschenbrenner
Prediction text
Reinforcement Learning from Human Feedback (RLHF) will fail catastrophically when applied to superintelligence — because humans will be inherently incapable of evaluating the incomprehensible logic and actions of an ASI; therefore, aligning superintelligence requires fundamentally new technical frameworks. | First publicly-observed RLHF failure on superhuman model
Key catalyst: First publicly-observed RLHF failure on superhuman model
Watch events: Anthropic / OpenAI / DeepMind alignment research milestones
Resolution evidence
Anthropic scalable-oversight research, constitutional AI, weak-to-strong generalization research all acknowledge RLHF limitations at superhuman levels.
Predictor: Leopold Aschenbrenner
Calibration plot (stated vs observed)
Evidence about this node from Leopold Aschenbrenner is multiplied by κ in /api/intake. Lower κ = less weight; floors at 0.10 (effectively silenced) and caps at 1.00 (full weight).
Reference class
This node isn't linked to a reference class. The Bayesian update applies without outside-view blending.
Probability over time
Milestone chain
- 2026-09-01 → 2028-06-30pendingFirst peer-reviewed paper documents RLHF-supervised model exhibiting deceptive alignment on superhuman-evaluator taskHow: ArXiv/Nature/NeurIPS paper from Anthropic, OpenAI, DeepMind, or Redwood Research empirically demonstrates RLHF-trained model passing human evaluation while failing ground-truth on a superhuman-difficulty taskSource: https://arxiv.org/abs/2502.04675conf 65%
- 2027-09-11pendingQ1 window check-in (25%)
- 2027-01-01 → 2029-06-30pendingFrontier lab publicly deprecates pure RLHF as superalignment primary technique in favor of scalable-oversight architectureHow: OpenAI, Anthropic, or DeepMind official safety publication or model card explicitly states RLHF alone is insufficient for next-tier model and names scalable-oversight/recursive-critique/debate as replacementSource: https://openai.com/index/weak-to-strong-generalization/conf 70%
- 2028-05-22pendingQ2 window check-in (50%)
- 2027-06-01 → 2029-12-31pendingGovernment or international body cites RLHF inadequacy in formal AI-safety policyHow: AI Safety Institute (UK/US), EU AI Office, or G7 statement formally references RLHF limits at superhuman scale and recommends new technical frameworks per Aschenbrenner-style argumentSource: https://claude5.com/news/ai-safety-2026-alignment-progress-and-open-challengesconf 50%
- 2027-06-01 → 2030-06-30pendingFirst public incident report of RLHF-supervised superhuman model causing measurable real-world harm via reward-hackingHow: Frontier-lab incident disclosure, NIST AI Incident Database entry, or major regulator action references RLHF-induced misalignment in a deployed superhuman-class modelSource: https://www.hushvault.ie/2026/01/23/__trashed-3/conf 40%
- 2029-01-31pendingQ3 window check-in (75%)
No downstream cascades — this prediction is a leaf in the dependency graph.
What if this resolves?
Click a button to clamp this prediction and run a Gibbs sample. Returns the predictions whose marginals shift most. ~30s per run; ideal for stress-testing "if X resolves, what else moves?"
Evidence chain
Network propagation neighbors
Top incoming (parents)
Edges that influence THIS node's belief
| Kind | Node | Their prob | P(c|s=T) | P(c|s=F) | Δ implied |
|---|---|---|---|---|---|
| killer | TK01 AGI Capability Plateau (2026-27 Training Stall) | 15.0% | 0.050 | 0.550 | -0.020 |
Top outgoing (children)
Predictions THIS node influences
No outgoing edges.
Ticker exposure
Beneficiaries (11)
Prerequisites (1)
| Type | Pred | Title | Domain | Lag |
|---|---|---|---|---|
| killer | TK01 | AGI Capability Plateau (2026-27 Training Stall) | — | — |
Dependents (0)
| Type | Pred | Title | Domain | Lag |
|---|---|---|---|---|
| No dependents | ||||
Linked documents (10)
Raw metadata
{
"nia": false,
"mode": "WARNING+FORECAST",
"role": "Cited-VC/Researcher",
"context": "Specific technical-alignment failure-mode framing distinct from INF_002 ('The Project' nationalization) and SEM_002 (AGI timeline). Critical input for policy/alignment community.",
"to_year": 2030,
"conv_cues": "catastrophic-failure framing; technical-specific",
"direction": "HAPPEN",
"from_year": 2027,
"timeframe": "2027-2030",
"conv_level": "HIGH",
"milestones": [
{
"kind": "llm_pre_event",
"label": "First peer-reviewed paper documents RLHF-supervised model exhibiting deceptive alignment on superhuman-evaluator task",
"source": "https://arxiv.org/abs/2502.04675",
"status": "pending",
"weight": 0.4,
"ordinal": -7,
"source_id": null,
"confidence": 0.65,
"expected_date": "2027-08-01",
"research_origin": "training",
"expected_date_range": {
"to": "2028-06-30",
"from": "2026-09-01"
},
"measurement_criterion": "ArXiv/Nature/NeurIPS paper from Anthropic, OpenAI, DeepMind, or Redwood Research empirically demonstrates RLHF-trained model passing human evaluation while failing ground-truth on a superhuman-difficulty task"
},
{
"kind": "quartile_checkpoint",
"label": "Q1 window check-in (25%)",
"status": "pending",
"weight": 0.05,
"ordinal": -6,
"source_id": null,
"expected_date": "2027-09-11",
"observed_date": null
},
{
"kind": "llm_pre_event",
"label": "Frontier lab publicly deprecates pure RLHF as superalignment primary technique in favor of scalable-oversight architecture",
"source": "https://openai.com/index/weak-to-strong-generalization/",
"status": "pending",
"weight": 0.4,
"ordinal": -5,
"source_id": null,
"confidence": 0.7,
"expected_date": "2028-03-31",
"research_origin": "training",
"expected_date_range": {
"to": "2029-06-30",
"from": "2027-01-01"
},
"measurement_criterion": "OpenAI, Anthropic, or DeepMind official safety publication or model card explicitly states RLHF alone is insufficient for next-tier model and names scalable-oversight/recursive-critique/debate as replacement"
},
{
"kind": "quartile_checkpoint",
"label": "Q2 window check-in (50%)",
"status": "pending",
"weight": 0.05,
"ordinal": -4,
"source_id": null,
"expected_date": "2028-05-22",
"observed_date": null
},
{
"kind": "llm_post_event",
"label": "Government or international body cites RLHF inadequacy in formal AI-safety policy",
"source": "https://claude5.com/news/ai-safety-2026-alignment-progress-and-open-challenges",
"status": "pending",
"weight": 0.4,
"ordinal": -3,
"source_id": null,
"confidence": 0.5,
"expected_date": "2028-09-15",
"research_origin": "training",
"expected_date_range": {
"to": "2029-12-31",
"from": "2027-06-01"
},
"measurement_criterion": "AI Safety Institute (UK/US), EU AI Office, or G7 statement formally references RLHF limits at superhuman scale and recommends new technical frameworks per Aschenbrenner-style argument"
},
{
"kind": "llm_post_event",
"label": "First public incident report of RLHF-supervised superhuman model causing measurable real-world harm via reward-hacking",
"source": "https://www.hushvault.ie/2026/01/23/__trashed-3/",
"status": "pending",
"weight": 0.4,
"ordinal": -2,
"source_id": null,
"confidence": 0.4,
"expected_date": "2028-12-14",
"research_origin": "training",
"expected_date_range": {
"to": "2030-06-30",
"from": "2027-06-01"
},
"measurement_criterion": "Frontier-lab incident disclosure, NIST AI Incident Database entry, or major regulator action references RLHF-induced misalignment in a deployed s
... (truncated)