arxiv:2506.01074

How Programming Concepts and Neurons Are Shared in Code Language Models

Published on Jun 1

· Submitted by

kargaranamir on Jun 3

Upvote

Authors:

Amir Hossein Kargaran ,

Yihong Liu ,

Abstract

LLMs representing multiple programming languages in their concept space tend to cluster closer to English and exhibit distinct neuron activations for specific languages, particularly in the upper layers, with highly aligned languages sharing similar representations.

AI-generated summary

Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model's concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model's concept space. Code is available at https://github.com/cisnlp/code-specific-neurons.

View arXiv page View PDF GitHub repository Add to collection

Community

kargaranamir

Paper author Paper submitter 2 days ago

How does pre-training on multiple programming languages and English affect the LLMs concept space in coding tasks?

More specifically:

RQ1. Does the LLMs use English or a programming language as a kind of pivot language?

RQ2. Can we identify language-specific neurons for programming languages and English? Do programming languages and English influence one another, and are neurons shared across them?

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.01074 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.01074 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.01074 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.