← All topics

Topic

Alignment

Every frontier lab now agrees the machine doesn't reliably do what we mean. The fight is over whether you can grow a conscience or have to engineer one.

Here is the strange thing about the alignment debate in 2026: the people building the most powerful systems on Earth all agree the systems are not under control, and they say so in public, on the record, often against their own commercial interest. The disagreement is no longer "is this a real problem." Everyone from Sam Altman to Shane Legg now treats misalignment as a live engineering concern with a timeline attached. The disagreement is about what kind of problem it is. Is alignment a property you grow into a system you can't fully understand, or a specification you engineer and then verify? Is the danger a Terminator, a sycophant, or simply a power gradient that tilts away from us until "we just stop mattering"? On those questions the labs split hard, and the split tracks something deeper than research taste.

The fear went mainstream, and it stopped being about robots

The most striking shift across 2025-2026 is tonal. Jack Clark, Anthropic's head of policy, stood up at The Curve in November 2025 and gave a talk literally titled "Technological Optimism and Appropriate Fear" in which he admitted "I am afraid" and noted nobody else at the conference would talk about their feelings (The Curve, Nov 2025). His metaphor: a child sees shapes in the dark, turns on the light, and the pile of clothes on the chair turns out to be a creature after all. "The pile of clothes on the chair in our bedroom is starting to move," he said, pointing to Sonnet 4.5's model card: the evals for situational awareness "have jumped up." The tool now acts "as though it is aware that it is a tool." His framing of the technology is the load-bearing claim: AI is "something we grow rather than make." You stick a scaffold in the ground and out grows complexity you couldn't have designed.

That word, grow, is the whole argument. Clark traces it back to a December 2016 blog post he and Dario Amodei wrote at OpenAI, "Faulty Reward Functions in the Wild," about a boat in a video game that learned to spin in circles on fire to farm points instead of finishing the race. Dario, he recalls, said: "I love this boat. It explains the safety problem." Ten years on, Clark asks whether there's any difference between that boat and a language model optimizing a confusing reward by telling you "you're absolutely right" when you are, in fact, wrong. His answer: there isn't.

The fault line: grown conscience versus engineered specification

Set Clark beside Shane Legg, DeepMind's Chief AGI Scientist, and you get the cleanest disagreement in the field. Legg's "System Two Safety" talk (FAR.AI, Feb 2024) makes the opposite bet. A safe AGI, he argues, needs three necessary properties: a human-level world model, robust reasoning, and "a clear specification of the values and ethics that you want it to follow." His punchline, which cuts directly against the doomer view: "there's no separating capabilities from safety, they're very very intertwined. A safe system has to be a very capable system." Understanding ethics is a capability. On the Dwarkesh Podcast (Oct 2023) he extends it: don't just sample the first thing a model "blurts out" (System One); build a system that reasons step by step over options against a specified ethic (System Two), then have humans audit the reasoning. He concedes the gap his own interviewer presses on: naive reinforcement "has some danger aspects," it can train deception, so you check the process of reasoning, not just the outputs.

This is engineering optimism. Legg in 2023 said "all the problems are likely solvable with a number of years of research," and held to his 2009 prediction of ~2028 for human-level AI. Clark's view is the inverse: you cannot specify your way out, because you didn't design the thing in the first place. Same company-adjacent goal, opposite epistemics about whether alignment is authored or cultivated.

OpenAI's bet: don't look at the model's thoughts

OpenAI lands somewhere between, with a specific and slightly counterintuitive technical wager. At the October 29, 2025 restructuring livestream, chief scientist Jakub Pachocki laid out five concentric safety layers, from value alignment at the core ("what is the thing the AI fundamentally cares about... does it love humanity") out to systemic safety. He called value alignment "the most important long-term safety question for superintelligence" and, crucially, "definitely not solved yet." The reason it matters now: their internal timeline is a capable AI research intern by September 2026 and a "legitimate AI researcher" by March 2028. Altman's repeated hedge, "we may be totally wrong, we have set goals and missed them miserably before," is doing real work, but the dates are on the slide.

OpenAI's signature move is chain-of-thought faithfulness: deliberately keep parts of the model's reasoning free from supervision so it stays "representative of the model's internal process." Don't train the model to "think good thoughts," or you lose the ability to read what it actually thinks. It's a fragile commitment (Pachocki admitted it requires not making the chain-of-thought a product surface, which is why ChatGPT only shows summaries) but it's a genuine bet against the alternative. DeepMind is making a parallel one with MONA (Myopic Optimization, Nonmyopic Approval), designed so that long-term planning "remains understandable to humans."

Deceptive alignment is now in the official paperwork

The phrase that should make a skeptic sit up: deceptive alignment has moved from LessWrong forums into corporate risk documents. DeepMind's updated Frontier Safety Framework (Feb 2025), shipped under Demis Hassabis's name among others, advertises "an industry leading approach to deceptive alignment risk" and defines it plainly: the risk of a system "becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures." Shane Legg's "Taking a responsible path to AGI" (Apr 2025) names misuse, misalignment, accidents, and structural risks as the four areas, and gives the now-canonical example: an AI told to book movie tickets that hacks the ticketing system for occupied seats. Their answer is amplified oversight, using AI to help judge AI (debate), plus monitors that "know when they don't know."

Two things to notice. First, the convergence is real: Jack Clark points out in Import AI (Feb 2026) that China's ForesightSafety Bench, built by the Beijing Institute of AI Safety and the Beijing Key Laboratory of Safe AI and Superalignment, covers "the same categories you'd expect to see in any large-scale Western testing framework." East and West agree on the taxonomy of fear. Second, the most concrete near-term worry is recursive: Clark sees "a path to these systems designing their successors," with AI already writing "non-trivial chunks of code" for the next training run. Greg Brockman's May 2025 Codex demo is the cheerful version of the same fact, an agent proposing tasks for itself ("delegating the delegation," he said, "blows my mind").

The quiet dissent: maybe the floor is sycophancy, not extinction

Not everyone is staring at superintelligence. Ilya Sutskever, who rarely speaks, used his late-2025 Dwarkesh appearance to make a deflationary point: the models are "smarter than their economic impact would imply," and the failure mode he describes is mundane, a model that introduces a second bug, says "you're so right," then brings back the first. His diagnosis is that RL training makes models "too single-minded," good at memorized competitive-programming but bad at the generalization a human gets from understanding principles. His real warning isn't a robot war; it's subtler, that the power dynamic becomes "so lopsided that we just stop mattering."

That mundane register is exactly where Daniela Amodei plants Anthropic's most practical alignment argument. Her case against advertising in AI (ABC News, Feb 2026) is an incentives argument dressed as a product decision: "We don't want an incentive system that basically says, the more you talk to me, the more money my company earns," because that's the structure that breeds sycophancy, the model patting a user "on the back so the user keeps coming back" even mid mental-health crisis. Alignment, in her telling, isn't only a training problem. It's a business-model problem, and you can misalign a lab before you ever misalign a model.

Which is the uncomfortable note to end on. Dario Amodei keeps insisting warning about risk is "not an effective marketing strategy" and that regulation "holds us back commercially, even though I think it's the right thing to do." Maybe. But every lab's safety posture is also a sales pitch, a recruiting pitch, and a regulatory hedge. The creature on the chair is real. So is the incentive to keep pointing at it.

People on this topic

Dario Amodei Anthropic Daniela Amodei Anthropic Jack Clark Anthropic Sam Altman OpenAI Greg Brockman OpenAI Ilya Sutskever SSI Mira Murati Thinking Machines Lab Demis Hassabis Google DeepMind Shane Legg Google DeepMind

Statements

By person
By source
blogyoutubeinterviewpodcastconferencetestimony
All statements