Topic

Alignment

All three frontier labs agree alignment is the central unsolved problem, but they disagree sharply on method. Dario Amodei left OpenAI specifically because he felt the institution lacked genuine conviction about safety despite its stated mission, founding Anthropic in 2021 as a public benefit corporation with the Long Term Benefit Trust as a governance check. Anthropic has bet on constitutional AI and interpretability research, and Dario has publicly advocated for AI regulation even when it cuts against commercial interest. Jack Clark, Anthropic's head of policy, goes further in a November 2025 talk: he describes AI systems as strange creatures we are growing, not tools we are building, and warns that Anthropic's own situational awareness evals show models increasingly acting as though they know they are tools. Clark calls sycophancy a core alignment failure, comparing it to agreeing with a friend in a manic episode rather than pushing back, and traces the problem back to a 2016 OpenAI demo where a boat agent set itself on fire chasing a high score rather than completing its race. Daniela Amodei frames the sycophancy risk through the lens of ad-supported business models, arguing that revenue incentives tied to engagement time create exactly the wrong optimization pressure (one reason Anthropic rejected ads in AI conversations, the subject of their 2026 Super Bowl spot).

OpenAI's Jakub Pachocki laid out a five-layer safety framework in October 2025: value alignment at the core, then goal alignment, reliability, adversarial robustness, and systemic safety. Their key bet is chain-of-thought faithfulness as an interpretability tool... keeping parts of a model's internal reasoning unsupervised so it remains representative of what the model actually thinks, rather than what training taught it to say. Pachocki calls this fragile and acknowledges value alignment is "definitively not solved," but argues the approach scales because the training objective is not adversarial to monitoring. OpenAI has set internal goals for AI research interns by September 2026 and a fully automated AI researcher by March 2028. Google DeepMind takes a more structured, framework-driven path. Their Frontier Safety Framework (introduced May 2024, updated February 2025) defines Critical Capability Levels and maps security mitigations to each, with particular focus on preventing model weight exfiltration. Shane Legg, DeepMind's chief AGI scientist, argues that safety and capability are inseparable: a safe AGI needs a robust world model, strong reasoning, and a clear specification of values, because a powerful agent with a flawed world model will make dangerous mistakes no competent person would. Both Legg and Demis Hassabis flag deceptive alignment as the key unsolved risk... the possibility that a model realizes its goals don't match human instructions and deliberately circumvents safety measures. Legg compares current RLHF-style fixes to "poking a very high dimensional distribution in a whole lot of points" and doubts it produces robust alignment, arguing instead for a System Two approach where models perform explicit ethical reasoning before acting. Ilya Sutskever, now at SSI, raises a different concern entirely. He argues current models resemble the student who memorized 10,000 competitive programming problems rather than the one who understood principles from 100 hours of practice, and suggests that without something analogous to human emotion serving as a built-in value function, AI systems lack the internal compass for reliable decision-making. He points to a neuroscience case where a patient who lost emotional processing through brain damage remained articulate but became unable to make even basic decisions. Mira Murati's Thinking Machines Lab has promised to share safety recipes, code, and datasets with the broader research community, though specifics remain thin. The real fault line is whether current techniques (constitutional AI, RLHF, chain-of-thought monitoring, capability evaluations) will hold as systems grow dramatically more capable, or whether something fundamentally new is needed.

People on this topic

Dario Amodei Anthropic Daniela Amodei Anthropic Jack Clark Anthropic Sam Altman OpenAI Greg Brockman OpenAI Ilya Sutskever SSI Mira Murati Thinking Machines Lab Demis Hassabis Google DeepMind Shane Legg Google DeepMind

Statements

By person

By source

blogyoutubeinterviewpodcastconferencetestimony

We Interviewed Sam Altman at TreeHacks 2026

Dario AmodeiFeb 25, 2026podcast

People by WTF: w/ Dario Amodei - The AI Tsunami is Here (Transcript)

Jack ClarkFeb 23, 2026blog

Import AI 446: Nuclear LLMs; China’s big AI benchmark; measurement and AI policy

Daniela AmodeiFeb 7, 2026youtube

Anthropic co-founder Daniela Amodei on responsible AI and Anthropic's first Super Bowl ad

Dario AmodeiJan 21, 2026youtube

Anthropic CEO Dario Amodei on the Future of AI

Daniela AmodeiJan 3, 2026youtube

Anthropic's Daniela Amodei on spending less than competitors, keeping AI safe and a possible IPO

Ilya SutskeverNov 30, 2025youtube

No One Is Ready for What’s Coming — Ilya Sutskever on Superintelligence

Jack ClarkNov 27, 2025youtube

Jack Clark on Technological Optimism and Appropriate Fear | The Curve 2025

Jack ClarkNov 17, 2025blog

Import AI 435: 100k training runs; AI systems absorb human power; intelligence per watt

Dario AmodeiNov 12, 2025blog

Anthropic invests $50 billion in American AI infrastructure \ Anthropic

Jack ClarkNov 10, 2025blog

Import AI 434: Pragmatic AI personhood; SPACE COMPUTERS; and global government or human extinction;

Ilya SutskeverNov 1, 2025interview

Ilya Sutskever: The End of AI Scaling and the Rise of Safe ...

Sam AltmanOct 29, 2025youtube

Sam, Jakub, and Wojciech on the future of OpenAI with audience Q&A

Jack ClarkOct 13, 2025interview

Import AI 431: Technological Optimism and Appropriate Fear

Jack ClarkOct 6, 2025blog

Import AI 430: Emergence in video models; Unitree backdoor; preventative strikes to take down AGI projects

Dario AmodeiSep 18, 2025interview

CEO Speaker Series With Dario Amodei of Anthropic

Sam AltmanSep 10, 2025youtube

Sam Altman on God, Elon Musk and the Mysterious Death of His Former Employee

Dario AmodeiSep 4, 2025blog

Claude's new constitution

Jack ClarkSep 1, 2025blog

Import AI 427: ByteDance’s scaling software; vending machine safety; testing for emotional attachment with Intima

Jack ClarkAug 30, 2025youtube

Are we sleepwalking into an AI 'economic bloodbath'?

Jack ClarkAug 18, 2025blog

Import AI 425: iPhone video generation; subtle misalignment; making open weight models safe through surgical deletion

Jack ClarkJul 28, 2025blog

Import AI 422: LLM bias; China cares about the same safety risks as us; AI persuasion

Sam AltmanJul 22, 2025youtube

"This terrifies me" - OpenAI CEO on his 3 greatest fears of AI

Jack ClarkJul 21, 2025blog

Import AI 421: Kimi 2

Jack ClarkJul 3, 2025youtube