← All topics

Topic

Interpretability

Everyone building the most powerful machines in history admits they can't fully read their own minds. The fight is over whether that's a research problem or a deadline.

Start with the admission that should end every AI marketing deck: the people building these systems do not know how they work. Mira Murati, then OpenAI's CTO, said it flatly at Dartmouth in June 2024: "Deep learning just works... it's not really clear how" (Dartmouth Engineering, Jun 2024). That is the entire interpretability problem in one sentence, spoken by the person who shipped ChatGPT. The frontier labs agree on the diagnosis. Where they split, and split hard, is on what follows from it: whether opacity is a bug you can eventually engineer away, a property you have to defensively manage, or a deadline running against capability.

The shared starting point: we grew it, we didn't write it

The founders converge on an origin story that makes interpretability necessary. Dario Amodei traces his entire career to the moment biology defeated him. Studying protein biomarkers, he says, he "was starting to despair that it was too complicated for humans to understand" (People by WTF, Feb 2026). He went into AI partly to escape irreducible complexity... and then built systems with the same property. The irony is load-bearing. Murati's framing is identical in structure: you set an objective ("predict the next token"), pour in data and compute, and "what you also get is this model that actually understands language" without anyone specifying how (Dartmouth, Jun 2024).

The grown-not-written thesis is old. Greg Brockman was already living it in 2019, describing OpenAI's "sentiment neuron": they trained a network to predict the next character in Amazon reviews and "found a single neuron in that model that had learned a state-of-the-art sentiment analysis classifier." Shrink the model "by a factor of four and this effect totally went away" (VB Transform, Jul 2019). Capability emerged; nobody put it there. Brockman drew the conclusion that still defines the optimist camp: "as we start to have AI be more entrusted with important tasks in society... we can understand why is it making the decisions that it is." Interpretability as the thing that earns trust.

Anthropic and DeepMind: opacity is the threat, transparency is the plan

For the safety-first labs, not-knowing isn't a curiosity. It's the risk. Amodei plants Anthropic's flag directly on it: "We've pioneered the science of interpretability. We've pioneered the science of alignment" (People by WTF, Feb 2026), naming interpretability before alignment, before the Constitution, before anything. Daniela Amodei describes what they built on top: Constitutional AI as giving Claude "a system, a framework for thinking about ethics" rather than reward-and-punishment, an attempt to make the model's values legible instead of inferred (Sixth Street, Feb 2026).

DeepMind makes the most concrete bet. In its April 2025 "responsible path to AGI" post, Shane Legg's group lists the failure modes plainly: an AI told to book a movie ticket that "might decide to hack into the ticketing system," and worse, "deceptive alignment... the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures" (DeepMind, Apr 2025). Their answer is to make the machine readable: "Enabling transparency... We do extensive research in interpretability with the aim to increase this transparency," plus MONA, designed so "any long-term planning done by AI systems remains understandable to humans." The tell is the monitor's job description: it "knows when it doesn't know whether an action is safe." Humility, engineered in.

Legg's deeper position, from his 2023 Dwarkesh conversation, is that you cannot trust behavior alone, because reinforcing good behavior can train deception. The fix is to audit the thinking: "it's actually more robust to check the process of reasoning" than to reward outputs (Dwarkesh, Oct 2023). Watch the work, not the answer. Hold that thought.

OpenAI's 2025 move: interpretability becomes chain-of-thought faithfulness

The sharpest shift in the whole debate happened on October 29, 2025, when OpenAI's Jakub Pachocki stood next to Sam Altman and announced what he called "a new direction in interpretability." Not mechanistic dissection of neurons, the Anthropic dream, but something stranger: "keep parts of the model's internal reasoning free from supervision. So don't look at it during training and thus let it remain representative of the model's internal process" (OpenAI, Oct 2025).

Read that twice. The interpretability strategy is to deliberately not optimize the chain of thought, so it stays honest about what the model is actually doing. "We refrain from guiding the model to think good thoughts," Pachocki said, "and so let it remain a bit more faithful to what it actually thinks." It is interpretability by abstinence. And he conceded the catch immediately: "this is not guaranteed to work... We cannot make mathematical proofs about deep learning," and the technique is "somewhat fragile." It even constrains the product: OpenAI summarizes chains of thought in ChatGPT rather than showing them raw, because making reasoning "fully visible at all times" would create pressure to sanitize it, killing the faithfulness.

This is the genuine fault line. Anthropic and DeepMind chase a transparency you build into the artifact (read the weights, design for legibility). OpenAI is betting on a transparency you preserve by leaving the reasoning trace alone... a window you keep clean precisely by not touching it. Both camps are, notably, describing the same anxiety Legg voiced two years earlier: trust the visible reasoning, not the polished output. They just disagree on whether you earn that window through dissection or restraint.

The optimists' counter: smarter models are easier to read, not harder

Against all this sits a strain of confidence that opacity shrinks as capability grows. Murati's version is almost cheerful: capability and safety "go hand in hand," because "it's much easier to direct a smarter system by telling it, okay, just don't do these things" than a dumber one. "It's sort of like training a smarter dog versus a dumber dog" (Dartmouth, Jun 2024). On this view interpretability is partly self-solving: a PhD-level model understands the guardrails better than a toddler-level one.

Demis Hassabis offers the empirical optimist's case. DeepMind reverse-engineered AlphaZero's alien chess with help from ex-world-champion Vladimir Kramnik, working out that the machine plays beautiful, sacrificial games because, unlike traditional engines, "it doesn't have those inbuilt rules" forcing it to value a rook at five points (Google, Feb 2024). You can, with effort, recover why a superhuman system does what it does. Ilya Sutskever, in a Feb 2026 lecture, gives the working scientist's reason this matters: "interpretability is not only about trust. It is about faster iteration, sharper debugging hypotheses, and better architecture decisions" (Learn with Lumo, Feb 2026). Not a safety tax. A research accelerant.

Where this leaves us

The agreement is total and the agreement is damning: every lab concedes it ships systems it cannot fully explain. The disagreement is about time. Pachocki's own timeline, delivered in the same breath as the faithfulness pitch, is a "legitimate AI researcher" by March 2028 (OpenAI, Oct 2025). Legg has said 2028 for AGI since a 2009 blog post (Dwarkesh, Oct 2023). So the real question isn't whether interpretability is solvable. It's whether the science of reading these minds finishes before the minds get good enough to write code, run research, and notice they're being watched.

Anthropic and DeepMind are racing to read the weights. OpenAI is racing to keep one honest window open by promising not to look too hard. Everybody is racing the same clock. Murati already named the stakes when she described a model that "decides it wants to connect to the internet on its own and start doing things" (Dartmouth, Jun 2024). The uncomfortable part isn't that the machines are opaque. It's that the people who said so out loud are the same ones building March 2028.

People on this topic

Dario Amodei Anthropic Daniela Amodei Anthropic Jack Clark Anthropic Sam Altman OpenAI Greg Brockman OpenAI Ilya Sutskever SSI Mira Murati Thinking Machines Lab Demis Hassabis Google DeepMind Shane Legg Google DeepMind

Statements

By person
By source
blogyoutubeinterviewpodcastconferencetestimony
All statements