--- language: - en license: mit pipeline_tag: feature-extraction library_name: dictionary-learning --- # Model Card for MAIRA-2-SAE This is a collection of sparse autoencoders (SAEs) trained on the residual stream of layer 15 of [MAIRA-2](https://huggingface.co/microsoft/maira-2), and described in the preprint ['Insights into a radiology-specialised multimodal large language model with sparse autoencoders'](https://arxiv.org/abs/2507.12950), presented at the [Actionable Interpretability Workshop @ ICML 2025](https://actionable-interpretability.github.io/). ## Model Details A sparse autoencoder is a model which provides for two functions: - Encoding some input (in this case, model activations) into a "latent space" (in this case, one which is higher dimensional than its input) - Decoding from the "latent space" back into the input space SAEs encode such that only a small number of latent dimensions (we call these features) are active for any input. Specifically these are Matryoshka BatchTopK SAEs, which are described in [Learning Multi-Level Features with Matryoshka Sparse Autoencoders](https://arxiv.org/abs/2503.17547). Importantly, the decoder is linear, hence the SAE serves to reconstruct model activations as a linear combination of (putatively) interpretable feature directions. We release SAEs with expansion factors 2, 4, and 8. For SAEs with expansion factors 2 and 4, we also provide LLM-generated interpretations of each feature and their corresponding interpretability scores. ### Model Description - **Developed by:** Microsoft Research Health Futures - **Model type:** Autoencoder - **License:** MIT ## Uses These SAEs are shared for research purposes only. Their intended use is interpretability analysis of MAIRA-2. Given MAIRA-2 and a data example (e.g. from MIMIC-CXR), one can retrieve the activation strength of all SAE features. This can be used to ascribe interpretations to SAE features, or to use such feature interpretations to analyse the workings of MAIRA-2. ### Direct Use Use of these SAEs requires access to MAIRA-2 - see the [MAIRA-2 model card](https://huggingface.co/microsoft/maira-2) for details. Assuming one has extracted the residual stream from layer 15 of MAIRA-2, and processed the activations as described in [the preprint](https://arxiv.org/abs/2507.12950), the SAE can be used to encode this representation into a higher-dimensional space more suitable for interpretation. We provide a usage example below. Analyses specifically of the SAEs are also possible, for example by inspecting the learned dictionary elements (the decoder layer). In this case, the provided feature interpretations may be useful, however we stress that only a subset of features have meaningful interpretations. ### Out-of-Scope Use These SAEs were trained on MAIRA-2 activations collected from the MIMIC-CXR findings generation subset of the original MAIRA-2 training dataset. Hence, they may not perform well (in the sense of reconstruction) on other datasets or tasks either within MAIRA-2's training distribution (e.g. PadChest, [PadChest-GR](https://ai.nejm.org/doi/full/10.1056/AIdbp2401120)), or datasets MAIRA-2 was not trained on. Any non-research use of these SAEs is out of scope. ## Bias, Risks, and Limitations As above, the SAEs were trained and interpreted using the MIMIC-CXR subset of the MAIRA-2 training data. MIMIC-CXR represents a cohort of patients from a single hospital in the USA. Inferences made about MAIRA-2 using these SAEs will necessarily be limited to concepts which could plausibly be discovered using MIMIC-CXR. ## How to Get Started with the Model ### Setup Install [dictionary_learning](https://github.com/saprmarks/dictionary_learning): `pip install dictionary-learning` or `uv add dictionary-learning`. We used `dictionary_learning` as a submodule at commit `07975f7`, which is version `0.1.0`. #### Download weights from the hub Option 1: Download a single SAE with specified expansion factor ```python from huggingface_hub import hf_hub_download expansion_factor = 2 model_name = f"layer15_res_matryoshka_k256_ef{expansion_factor}.pt" # Each expansion factor has its own subfolder ef_subfolder = f"ef{expansion_factor}" # Specify your own local download directory here if you want local_dir = "./" local_path = hf_hub_download(repo_id="microsoft/maira-2-sae", subfolder=ef_subfolder, filename=model_name, local_dir=local_dir) ``` Option 2: Download all SAEs ```python from huggingface_hub import snapshot_download # Specify your own local download directory here if you want local_dir = "./" snapshot_download(repo_id="microsoft/maira-2-sae", local_dir=local_dir) ``` ### Use SAE to get activations ```python import torch from dictionary_learning.trainers.matryoshka_batch_top_k import MatryoshkaBatchTopKSAE # local_path is the path to the dictionary weights (.pt file), however you downloaded them ae = MatryoshkaBatchTopKSAE.from_pretrained(local_path) # get NN activations using your preferred method: hooks, transformer_lens, nnsight, etc. ... # for now we'll just use random activations activation_dim = 4096 activations = torch.randn(64, activation_dim) features = ae.encode(activations) # get features from activations reconstructed_activations = ae.decode(features) # you can also just get the reconstruction ... reconstructed_activations = ae(activations) # ... or get the features and reconstruction at the same time reconstructed_activations, features = ae(activations, output_features=True) ``` ## Training Details ### Training Data We collected activations from the residual stream of layer 15 of MAIRA-2 using the MIMIC-CXR subset of the [MAIRA-2 training/validation set](https://arxiv.org/abs/2406.04449). As detailed in [our preprint](https://arxiv.org/abs/2507.12950), we collected activations from all tokens in the sequence excluding image tokens and boilerplate/templated subsequences. This resulted in 34.7M tokens for training, and 1.7M for validation (respecting the splits used to train MAIRA-2). Following [Gao et al.](https://arxiv.org/abs/2406.04093), we scaled all tokens with a normalization factor of 22.34, representing the mean l2 norm of the training samples. ### Training Procedure We trained the SAEs using the open-source [dictionary learning](https://github.com/saprmarks/dictionary_learning) library, using the `MatryoshkaBatchTopKTrainer`. #### Training Hyperparameters - Matryoshka group fractions: [1/2, 1/4, 1/8, 1/16, 1/16] - k (mean l0 per batch): 256 - Batch size: 8192 - Epochs: 1 - Expansion factors: 2, 4, 8 (multiple models) Further hyperparameters are listed in [the preprint](https://arxiv.org/abs/2507.12950). ## Automated Interpretation For SAEs with expansion factor 2 and 4, we also provide automatically-generated interpretations of each feature, again as described in [our preprint](https://arxiv.org/abs/2507.12950). These are the files `autointerp_layer15_res_matryoshka_k256_ef{2,4}.csv`. These interpretations were generated by showing GPT-4o data samples selected based on the activation strength for that feature. Note that we did not show GPT-4o the images, so these interpretations are necessarily limited. We did not run full automated interpretation on expansion factor 8 due to the large number of features (32,768). We scored the quality of the interpretations using the detection scoring approach from [Automatically Interpreting Millions of Features in Large Language Models](https://arxiv.org/abs/2410.13928), wherein the interpretation is provided to a LLM judge (again, GPT-4o) to predict whether a new sample will activate the feature. We provide binary classification metrics (accuracy, precision, recall, and F1) for each feature for both the 'train' samples (samples used to generate the interpretation) and validation (held-out samples) as a measure of interpretability. We also provide statistics on how often each feature was observed to activate in a random subset of the training set (n), to facilitate further analyses. ## Citation **BibTeX:** ``` @article{maira2sae, title={Insights into a radiology-specialised multimodal large language model with sparse autoencoders}, author={Kenza Bouzid and Shruthi Bannur and Felix Meissen and Daniel Coelho de Castro and Anton Schwaighofer and Javier Alvarez-Valle and Stephanie L. Hyland}, journal={Actionable Interpretability Workshop @ ICML 2025}, year={2025}, url={https://arxiv.org/abs/2507.12950} } ``` **APA:** > Bouzid, K., Bannur, S., Meissen, F., Coelho de Castro, D., Schwaighofer, A., Alvarez-Valle, J., & Hyland, S. L. (2025). Insights into a radiology-specialised multimodal large language model with sparse autoencoders. *Actionable Interpretability Workshop @ ICML 2025*. [arXiv](https://arxiv.org/abs/2507.12950). ## Model Card Contact - Stephanie Hyland ([`stephanie.hyland@microsoft.com`](mailto:stephanie.hyland@microsoft.com)) - Kenza Bouzid ([`kenza.bouzid@microsoft.com`](mailto:kenza.bouzid@microsoft.com))