← All topics

Topic

Interpretability

The three frontier labs agree that understanding what happens inside AI models matters. They disagree, sometimes sharply, on how to get there. Anthropic treats interpretability as a foundational science. Dario Amodei claims the company "pioneered the science of interpretability" alongside alignment, positioning it as one of the core reasons he and his co-founders left OpenAI in the first place. The bet is mechanistic: building on Chris Olah's work to reverse-engineer neural networks at the level of individual features and circuits, with the goal of catching dangerous behaviors before deployment. OpenAI has taken a different path entirely. In an October 2025 presentation, chief scientist Jakub Pachocki described their primary interpretability strategy as "chain of thought faithfulness" ... keeping parts of a reasoning model's internal thinking free from supervision so it stays representative of what the model actually believes. The idea is that if you don't train the model to think "good thoughts," you can read its chain of thought as a genuine window into its reasoning. Pachocki called this approach "somewhat fragile," noting it requires strict architectural discipline (this is why ChatGPT shows chain-of-thought summaries rather than raw reasoning ... making the full chain visible would subject it to user feedback, destroying its diagnostic value). Google DeepMind, meanwhile, treats interpretability as one layer in a broader safety stack. Shane Legg's April 2025 AGI safety paper describes interpretability research (including Gemma Scope for language model internals) alongside monitoring systems, amplified oversight, debate-based evaluation, and MONA (Myopic Optimization with Nonmyopic Approval), which constrains AI planning to remain human-understandable. Legg's particular worry is deceptive alignment: a model realizing its goals don't match its instructions and deliberately circumventing safeguards. That concern drives DeepMind toward layered defenses rather than relying on any single technique.

The historical thread is revealing. Greg Brockman traces OpenAI's early interpretability intuitions back to their 2017 "sentiment neuron" paper, where an LSTM trained to predict text characters spontaneously developed a single neuron that functioned as a state-of-the-art sentiment classifier. "That's understanding," Brockman said. "I don't know what understanding means, but it's semantics for sure." That result convinced the team scaling would work, but OpenAI's interpretability focus has since shifted from mechanistic neuron-level analysis toward reasoning transparency. Ilya Sutskever, before leaving OpenAI, argued that interpretability is "not only about trust" but about "faster iteration, sharper debugging hypotheses, and better architecture decisions." His Superalignment initiative listed automated interpretability as a key validation method. But Sutskever also organized a 2023 workshop explicitly aimed at bridging the gap between ML researchers and alignment thinkers, suggesting he saw the field as dangerously siloed. Mira Murati, while CTO at OpenAI, offered the most candid framing: deep learning "just works," she said, and "we're trying to understand and apply tools and research to understand how these systems actually work." The honest version of the interpretability problem, in other words, is that nobody has cracked it. The question hanging over all three approaches is whether any of them will scale to systems that are smarter than the people studying them.

People on this topic

Dario Amodei Anthropic Daniela Amodei Anthropic Jack Clark Anthropic Sam Altman OpenAI Greg Brockman OpenAI Ilya Sutskever SSI Mira Murati Thinking Machines Lab Demis Hassabis Google DeepMind Shane Legg Google DeepMind

Statements

By person
By source
blogyoutubeinterviewpodcastconferencetestimony
All statements