Llama 3.2 1B MLA - Multi-head Latent Attention Model (Experimental)

This repository contains a version of Llama 3.2 1B converted to use Multi-head Latent Attention (MLA) instead of Group Query Attention (GQA).

Model Details

  • Base Model: Meta-Llama-3.2-1B
  • Attention Mechanism: Multi-head Latent Attention (MLA)
  • Performance Improvement: Approximately 70% faster inference than GQA with the same KV cache size

What is MLA?

Multi-head Latent Attention (MLA) is an attention mechanism introduced in the DeepSeek-V2 paper and further explored in the TransMLA paper. MLA uses low-rank factorization to compress Key (K) and Value (V) representations during attention, significantly reducing the KV cache size while maintaining or even improving model expressivity.

Unlike Group Query Attention (GQA), which simply reduces the number of KV heads, MLA maintains the expressivity of having unique K and V representations for each query head by using factorized projection matrices.

Advantages over GQA

  • Same KV Cache Size: MLA maintains the same KV cache size as GQA
  • Greater Expressivity: Each Q head can have its own K and V representation (unlike GQA)
  • Better Performance: Significantly faster generation due to better memory utilization
  • No Retraining Required: Conversion can be performed post-training using SVD

Implementation Details

The model was converted using SVD (Singular Value Decomposition) to factorize the weight matrices. The process:

  1. Decomposes the original K and V matrices into low-rank approximations
  2. Creates compression and decompression layers that maintain the same KV cache size as GQA
  3. Preserves the original model's knowledge while improving inference efficiency

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    "BarraHome/llama3.2-1b-mla",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("BarraHome/llama3.2-1b-mla")

# Example chat prompt
prompt = """<|begin_of_text|><|system|>
You are a helpful, respectful, and honest assistant.
<|user|>
What is Multi-head Latent Attention (MLA)?
<|assistant|>"""

# Generate response
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

# Print response
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance Benchmarks

When compared to the original Llama 3.2 1B model with GQA, our performance tests show:

image/png

These variations in performance likely depend on various factors including GPU utilization, batch size, and system load. In general, the MLA version provides at least comparable performance to the GQA version, with significant speed improvements possible under certain conditions.

Both models maintain the same KV cache memory footprint while the MLA version provides greater expressivity by allowing each query head to have its own unique key and value representations.

Conversion Method

The conversion from GQA to MLA was performed using the approach described in the TransMLA: Multi-Head Latent Attention Is All You Need paper. The key insight is that GQA can always be represented by MLA with the same KV cache overhead, but MLA offers greater expressivity.

Citation

If you use this model in your research or projects, please cite:

@misc{ferrer2025llama32mla,
  title={Llama 3.2 1B MLA - Multi-head Latent Attention},
  author={Ferrer, Alberto},
  year={2025},
  howpublished={\url{https://huggingface.co/BarraHome/llama3.2-1b-mla}}
}

Also consider citing the underlying TransMLA methodology:

@article{meng2025transmla,
  title={TransMLA: Multi-Head Latent Attention Is All You Need},
  author={Meng, Fanxu and Yao, Zengwei and Zhang, Muhan},
  journal={arXiv preprint arXiv:2502.07864},
  year={2025}
}

License

This model is subject to the same license as the original Meta-Llama-3.2-1B model. Please refer to Meta's licensing terms for usage restrictions.

Acknowledgements

  • Developed by Alberto Ferrer (BarraHome)
  • Thanks to the authors of the TransMLA paper for their insights on converting GQA to MLA
  • Thanks to DeepSeek AI for the original introduction of MLA in their DeepSeek-V2 model
  • Thanks to Meta for releasing the Llama 3.2 models
Downloads last month
39
Safetensors
Model size
1.24B params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.