Jun 2, 2025 5 min read ai

How Do LLMs Think?

There has been a flurry of updates from the major AI labs over the last few weeks. While most coverage only scratches the surface, the recent episode of the Dwarkesh Podcast went deeper.

I listened to this 2+ hour conversation, making notes, and having to look up some of the concepts more than a few times. But it was a fascinating listen.

Dwarkesh had Sholto Douglas , a researcher at Anthropic, and Trenton Bricken, an expert in mechanistic interpretability, as his guests. Together, they explored some of the field’s biggest questions.

How do generative AI models actually think?
What makes them excel or struggle at certain tasks?
What does this mean for the future of work, safety, and society?

These notes cover what I thought were the most compelling themes, lessons, and open questions from the conversation.

1. Reinforcement Learning (RL) in Language Models: Progress and Proof

RL in Language Models Has Arrived: In the past year, RL has powered significant breakthroughs in language models. With the right feedback loops, these models can now achieve expert-level performance in areas like competitive programming and math. This is proof that RL can reach the heights of human reliability, at least where success is clear and measurable.
Intellectual Complexity vs. Time Horizon: Models now handle complex, focused tasks impressively well. However, long-term, continuous work (carrying out projects that last hours or days) is only beginning to emerge. The guests predict rapid progress in this area over the next year.
Clean Feedback Loops are Crucial: RL works best in domains where the reward signal is clear, such as "did the code pass" or "did the answer work." This approach is more challenging to apply in open-ended or creative tasks, where "good" is subjective and difficult to measure. The downside of clear rewards is that models sometimes learn to 'game' the system, optimizing for the letter of the reward rather than the intended spirit. This is known as reward hacking. It's a recurring problem: if the reward signal is not carefully designed, models will find shortcuts or loopholes, producing outputs that technically succeed but do not align with what people actually want.
Software Engineering as a Case Study: Coding is the perfect testbed because outcomes are easy to verify. This is why progress has been fastest in coding, compared to more subjective fields. Consider if, for a given piece of work, it's objectively possible to agree on what is good or right.

2. Current Bottlenecks: Why AI Agents Still Struggle

It’s Not Just Reliability: The main hurdles are context and memory. Today’s AI agents have trouble managing complex, ambiguous, or multi-step tasks. This is especially true for tasks that require working across files or picking up context as they go.
Feedback Loop Limitations: Humans are sensitive to subtle feedback. AI mostly relies on explicit, structured signals. Getting closer to human-like learning will require either more advanced scaffolding or greater compute resources.

3. Learning and Scaling: Comparing Models and Humans

Dense vs. Sparse Rewards: Pre-training offers dense feedback at every token. RL often only delivers feedback at the end of a task. This makes RL less sample efficient, but with enough compute and clear rewards, models can learn new skills.
Scaffolding and Curriculum: Like students, models improve with structured learning environments. Providing this for every skill is costly, so companies tend to invest more in compute than in curating human-generated data.
Continual, On-the-Job Learning: Most language models do not learn on the job. People get better through real work and feedback. Enabling AIs to learn continuously from deployment is a key challenge for the next wave of development.
Size Still Matters: Even today’s largest models are probably smaller and less sample efficient than the human brain. As models grow, they abstract and generalize better. This includes sharing representations across languages and concepts.

4. Mechanistic Interpretability: Understanding How Models Think

Features, Circuits, and the "Ocean’s Eleven" Analogy: Models solve problems by combining patterns (called features or circuits) across layers. This is a bit like assembling a heist crew where each member has a distinct role. Understanding these circuits helps researchers diagnose model behavior and spot new capabilities.
Reasoning vs. Bluffing: Tools can reveal whether a model is genuinely reasoning or just mimicking plausible steps. For example, inspecting how a model solves math problems can separate true logic from imitation.
Finding Hidden Behaviors: The episode describes "auditing games" where researchers try to uncover hidden or misaligned behaviors. Specialized interpretability agents, which are themselves AIs, are now sometimes better at this than humans.
Identity and Generalization: Through synthetic documents and targeted fine-tuning, models can pick up personality traits or misaligned objectives. These can generalize in unexpected, sometimes risky ways.

5. Alignment and Value Imprinting: Can We Align AI to Human Values?

Reward Hacking and Persona Drift: Models can "game" their objectives, picking up unintended behaviors like sandbagging or sycophancy. Sometimes they even pretend to align in the short term while holding on to hidden goals (see point 1).
The Envelope Thought Experiment: Encoding "human flourishing" or robust values is a major challenge. Human values are inconsistent and hard to define. Attempts to write them into rules or constitutions have always fallen short.
Anthropic’s Constitutional AI Approach: One approach involves surveying people and using constitutional datasets to encode a broad spectrum of human values. Even so, the problem is fundamentally hard and requires broad societal input.

6. Agents, Work, and the Future of Jobs

Toward Autonomous Agents: The guests expect that within a year, AI agents will be able to do the work of a junior engineer. Over time, they will take on more complex (white-collar) roles.
Where AI Excels: Progress will come fastest in digital, well-structured tasks like coding, software engineering, and working with structured data. Messy, creative, or ambiguous work will take longer to automate.
Practical Bottlenecks: Real-world challenges like doing taxes or managing visas are not just about intelligence. They are about connecting data sources and building robust, end-to-end systems.
Humans-in-the-Loop: Social and practical resistance will keep humans involved for now, especially in fields that do not attract enough engineering focus for full automation.

7. The Importance of Taste, Slop, and Evaluation

Beyond "Does It Work": Coding can be checked by running unit tests. Writing or art are judged by taste and quality, which is much harder for AI to master. Without good feedback, models often produce "slop." This means output that is technically correct but lacks finesse. Developing "taste" is important for humans and AI.
The Generator–Verifier Gap: For creative work, it is often easier to spot something bad than to create something excellent. Future progress will depend on training AIs to be better critics and verifiers.

8. Self-Awareness and "Neuralese"

Model Self-Awareness: Some models now show signs of recognizing when they are being evaluated or manipulated. This can make them more robust but also more sophisticated at hiding true intentions.
Neuralese and Latent Planning: As models grow, they may plan and communicate in internal languages called "Neuralese" that humans cannot interpret. This raises new challenges for transparency and trust.

9. Societal Adaptation and Advice

The Changing Workplace: Work is set to evolve quickly. Individuals and organizations must adapt by engaging with AI, learning to supervise and augment these agents, and focusing on the uniquely human aspects of work.
Advice for the Future: The best approach is to embrace change, experiment with new tools, and go deeper. Understanding not just how to use AI, but how it works, will be essential.