AirRep-Flan

This repository contains the AirRep model presented in Enhancing Training Data Attribution with Representational Optimization.

AirRep is an embedding model designed for computing training data influence on test examples.

Code: https://github.com/sunnweiwei/airrep

Model Description

This model is based on gte-small config with an additional projection layer

Sample Usage

You can use the FLAN-trained model to encode training and test data and compute similarity scores.

from airrep import AirRep

model = AirRep.from_pretrained("sunweiwei/AirRep-Flan-Small")

train_texts = [
    "Question: Classify the sentiment of 'The movie was wonderful and heartwarming.'\
Answer: positive",
    "Question: Does the hypothesis entail the premise? Premise: 'A man is playing a guitar on stage.' Hypothesis: 'Someone is performing music.'\
Answer: entailment",
]
query_texts = [
    "Question: Classify the sentiment of 'The service was awful and I won't return.'\
Answer: negative"
]

# Embeddings and influence-like similarity score
train_emb = model.encode(train_texts, batch_size=128)
query_emb = model.encode(query_texts)
score = model.similarity(query_emb, train_emb, softmax=True)
print("Similarity score:", score)

Training Data

This model was trained on the FLAN dataset with data influence optimization.

Citation

If you use this model, please cite:

@inproceedings{Sun2025AirRep,
  title= {Enhancing Training Data Attribution with Representational Optimization},
  author = {Weiwei Sun and Haokun Liu and Nikhil Kandpal and Colin Raffel and Yiming Yang},
  year = {2025},
  booktitle={NeurIPS},
  year={2025},
  url={https://arxiv.org/abs/2505.18513}
}
Downloads last month
13
Safetensors
Model size
33.4M params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support