Post
π Today's pick in Interpretability & Analysis of LMs: Universal Neurons in GPT2 Language Models by
@wesg
@thorsley
@luminolblue
T.R. Kheirkhah Q. Sun,
@willhath
@NeelNanda
D. Bertsimas
This work investigates the universality of individual neurons across GPT2 models trained from different initial random seeds, starting from the assumption that such neurons are likely to exhibit interpretable patterns. The authors find 1-5% of neurons showing high correlation across five model seeds, i.e. consistently activating for the same inputs. In particular, those neurons can be grouped into families exhibiting similar functional roles, e.g. modulating the next token prediction entropy, controlling the output norm of an attention head, and promoting/suppressing vocabulary elements in the prediction. Finally, universal neurons are often observed to form antipodal pairs, conjectured to improve the robustness and calibration of model predictions via ensembling.
π Paper: Universal Neurons in GPT2 Language Models (2401.12181)
This work investigates the universality of individual neurons across GPT2 models trained from different initial random seeds, starting from the assumption that such neurons are likely to exhibit interpretable patterns. The authors find 1-5% of neurons showing high correlation across five model seeds, i.e. consistently activating for the same inputs. In particular, those neurons can be grouped into families exhibiting similar functional roles, e.g. modulating the next token prediction entropy, controlling the output norm of an attention head, and promoting/suppressing vocabulary elements in the prediction. Finally, universal neurons are often observed to form antipodal pairs, conjectured to improve the robustness and calibration of model predictions via ensembling.
π Paper: Universal Neurons in GPT2 Language Models (2401.12181)