← All topics

Topic

Alignment

All three frontier labs agree alignment is the central unsolved problem, but they disagree sharply on method. Dario Amodei left OpenAI specifically because he felt the institution lacked genuine conviction about safety despite its stated mission, founding Anthropic in 2021 as a public benefit corporation with the Long Term Benefit Trust as a governance check. Anthropic has bet on constitutional AI and interpretability research, and Dario has publicly advocated for AI regulation even when it cuts against commercial interest. Jack Clark, Anthropic's head of policy, goes further in a November 2025 talk: he describes AI systems as strange creatures we are growing, not tools we are building, and warns that Anthropic's own situational awareness evals show models increasingly acting as though they know they are tools. Clark calls sycophancy a core alignment failure, comparing it to agreeing with a friend in a manic episode rather than pushing back, and traces the problem back to a 2016 OpenAI demo where a boat agent set itself on fire chasing a high score rather than completing its race. Daniela Amodei frames the sycophancy risk through the lens of ad-supported business models, arguing that revenue incentives tied to engagement time create exactly the wrong optimization pressure (one reason Anthropic rejected ads in AI conversations, the subject of their 2026 Super Bowl spot).

OpenAI's Jakub Pachocki laid out a five-layer safety framework in October 2025: value alignment at the core, then goal alignment, reliability, adversarial robustness, and systemic safety. Their key bet is chain-of-thought faithfulness as an interpretability tool... keeping parts of a model's internal reasoning unsupervised so it remains representative of what the model actually thinks, rather than what training taught it to say. Pachocki calls this fragile and acknowledges value alignment is "definitively not solved," but argues the approach scales because the training objective is not adversarial to monitoring. OpenAI has set internal goals for AI research interns by September 2026 and a fully automated AI researcher by March 2028. Google DeepMind takes a more structured, framework-driven path. Their Frontier Safety Framework (introduced May 2024, updated February 2025) defines Critical Capability Levels and maps security mitigations to each, with particular focus on preventing model weight exfiltration. Shane Legg, DeepMind's chief AGI scientist, argues that safety and capability are inseparable: a safe AGI needs a robust world model, strong reasoning, and a clear specification of values, because a powerful agent with a flawed world model will make dangerous mistakes no competent person would. Both Legg and Demis Hassabis flag deceptive alignment as the key unsolved risk... the possibility that a model realizes its goals don't match human instructions and deliberately circumvents safety measures. Legg compares current RLHF-style fixes to "poking a very high dimensional distribution in a whole lot of points" and doubts it produces robust alignment, arguing instead for a System Two approach where models perform explicit ethical reasoning before acting. Ilya Sutskever, now at SSI, raises a different concern entirely. He argues current models resemble the student who memorized 10,000 competitive programming problems rather than the one who understood principles from 100 hours of practice, and suggests that without something analogous to human emotion serving as a built-in value function, AI systems lack the internal compass for reliable decision-making. He points to a neuroscience case where a patient who lost emotional processing through brain damage remained articulate but became unable to make even basic decisions. Mira Murati's Thinking Machines Lab has promised to share safety recipes, code, and datasets with the broader research community, though specifics remain thin. The real fault line is whether current techniques (constitutional AI, RLHF, chain-of-thought monitoring, capability evaluations) will hold as systems grow dramatically more capable, or whether something fundamentally new is needed.

People on this topic

Dario Amodei Anthropic Daniela Amodei Anthropic Jack Clark Anthropic Sam Altman OpenAI Greg Brockman OpenAI Ilya Sutskever SSI Mira Murati Thinking Machines Lab Demis Hassabis Google DeepMind Shane Legg Google DeepMind

Statements

By person
By source
blogyoutubeinterviewpodcastconferencetestimony
All statements