Papers
arxiv:2503.18878

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Published on Mar 24
· Submitted by therem on Mar 25
#1 Paper of the day
Authors:
,
,

Abstract

Large Language Models (LLMs) have achieved remarkable success in natural language processing. Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored. In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models. First, we propose an approach to extract candidate ''reasoning features'' from SAE representations. We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model's reasoning abilities. Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning

Community

Paper author Paper submitter

In this work, we try to uncover how reasoning works in LLMs. We focused on DeepSeek-R1 series of models, and applied Sparse Autoencoders (SAEs) to identify interpretable features within them. We developed a method to detect reasoning-relevant features and validated them through empirical analysis and feature steering.
Our experiments showed that amplifying these features may enhance the model's reasoning capabilities, both qualitatively and across reasoning benchmarks.
Ultimately, we provide the first mechanistic evidence linking specific features in LLMs to reasoning behavior like reflection, uncertainty handling, and step-by-step problem-solving.

This comment has been hidden (marked as Off-Topic)
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.18878 in a Space README.md to link it from this page.

Collections including this paper 20