ReplaceMe
Collection
Pruning with a training-free approach
β’
5 items
β’
Updated
β’
2
ReplaceMe is a novel method for transformer model compression that enables training-free block/layer pruning while maintaining model performance through linear transformations. The approach:
Method | num_pruned_layers | Dataset | State | race π | winogrande π² | piqa π§ | boolq β | openbookqa π | sciq π¬ | lambada_openai π¦ | ppl | Avg-acc π |
---|---|---|---|---|---|---|---|---|---|---|---|---|
acc | acc | acc_norm | acc | acc_norm | acc_norm | acc | ||||||
Llama 3.1 (baseline) | - | - | - | 0.450 | 0.779 | 0.810 | 0.842 | 0.430 | 0.961 | 0.732 | 3.404 | 0.712 |
UIDL* | 8 | slim_orca | no training | 0.341 | 0.719 | 0.690 | 0.773 | 0.310 | 0.719 | 0.087 | 932.000 | 0.592 |
ReplaceMe (Ours) β | 8 | slim_orca | no training | 0.406 π | 0.742 π | 0.706 π | 0.830 π | 0.338 π | 0.901 π | 0.471 π | 16.760 π | 0.654 π |
Key:
Metrics Explained:
π₯ Our training-free methods achieve 92.5% of baseline performance while other approaches require expensive retraining!
pip install replaceme
# or
git clone https://github.com/mts-ai/ReplaceMe
cd ReplaceMe
pip install -e .
# LSTSQ method (recommended)
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml
# Cosine similarity method
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml
There are many parameters you can play with, visit our repo and dscover π₯π₯
As we said we are merging the LTs with the original transformer architecture so you just do it as usual
## EXAMPLE
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "MTSAIR/Llama3.1-6B-ReplaceMe"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is ReplaceME pruning method?!"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(
**model_inputs,
max_new_tokens=512
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
If you use ReplaceMe in your research, please cite our paper:
@article{shopkhoev2025replaceme0,
title = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
author = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
year = {2025},
journal = {arXiv preprint arXiv: 2505.02819}
}