AirRep-Flan
This repository contains the AirRep model presented in Enhancing Training Data Attribution with Representational Optimization.
AirRep is an embedding model designed for computing training data influence on test examples.
Code: https://github.com/sunnweiwei/airrep
Model Description
This model is based on gte-small config with an additional projection layer
Sample Usage
You can use the FLAN-trained model to encode training and test data and compute similarity scores.
from airrep import AirRep
model = AirRep.from_pretrained("sunweiwei/AirRep-Flan-Small")
train_texts = [
"Question: Classify the sentiment of 'The movie was wonderful and heartwarming.'\
Answer: positive",
"Question: Does the hypothesis entail the premise? Premise: 'A man is playing a guitar on stage.' Hypothesis: 'Someone is performing music.'\
Answer: entailment",
]
query_texts = [
"Question: Classify the sentiment of 'The service was awful and I won't return.'\
Answer: negative"
]
# Embeddings and influence-like similarity score
train_emb = model.encode(train_texts, batch_size=128)
query_emb = model.encode(query_texts)
score = model.similarity(query_emb, train_emb, softmax=True)
print("Similarity score:", score)
Training Data
This model was trained on the FLAN dataset with data influence optimization.
Citation
If you use this model, please cite:
@inproceedings{Sun2025AirRep,
title= {Enhancing Training Data Attribution with Representational Optimization},
author = {Weiwei Sun and Haokun Liu and Nikhil Kandpal and Colin Raffel and Yiming Yang},
year = {2025},
booktitle={NeurIPS},
year={2025},
url={https://arxiv.org/abs/2505.18513}
}
- Downloads last month
- 13