flan-t5-xl-rpo / README.md
bilalfaye's picture
Update README.md
d14f1ad verified
---
base_model:
- google/flan-t5-small
- google/flan-t5-large
- google/flan-t5-xl
---
## 🧠 Flan-T5-{Small|Large|XL}-RPO
> πŸ”¬ Fine-tuned with **Reward Partitioning Optimization (RPO)** β€” a value-free, stable method for single-trajectory reinforcement learning with scalar feedback.
---
### πŸ“Œ Model Summary
This model is a fine-tuned variant of the [Flan-T5](https://huggingface.co/google/flan-t5) {Small|Large|XL} checkpoint, trained using **Reward Partitioning Optimization (RPO)**. RPO is a new method designed for learning from single-trajectory scalar feedback (e.g., thumbs up/down), and eliminates the need for learning value functions or preference pairs.
* βœ… Trained with only (prompt, response, reward) triplets.
* πŸ” No joint optimization, no auxiliary models.
* πŸš€ Efficient and stable training.
* πŸ€– Strong preference alignment (evaluated by LLM-as-a-judge).
* πŸ“Š Outperforms KTO and DRO in automatic metrics and LLM preference winrate.
---
### πŸ§ͺ Training Details
* **Base Model:** `flan-t5-{small|large|xl}`
* **Dataset:** [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) β€” high-quality (prompt, response, reward) triplets with multiple completions per prompt.
* **Feedback Format:** scalar reward (e.g., \[prompt, response, reward]).
* **GPU Used:** 1Γ— A100 (80GB)
* **Training Objective:** RPO supervised learning using partitioned reward normalization.
* **Baselines Compared:** DRO and KTO.
---
### πŸ€– Inference
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "bilalfaye/flan-t5-{small|large|xl}-rpo"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
prompt = "How can I improve my productivity working from home?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
input_ids=inputs["input_ids"],
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
```
---
### πŸ“ˆ Evaluation Summary
| Judge | Win Rate vs DRO | Win Rate vs KTO | Win Rate vs SFT |
| ------- | --------------- | --------------- | --------------- |
| Mistral | βœ… **83–93%** | βœ… **82–93%** | βœ… **82–84%** |
| LLaMA | βœ… **67–74%** | βœ… **65–72%** | βœ… **63–73%** |
---
### βœ… Use Cases
* Aligned conversational agents
* Helpful, non-toxic instruction following
* Scalar feedback training pipelines
* Preference-optimized generation (without pairwise preference labels)
---
### πŸ“š Citation
If you use this model, please cite the following paper:
```bibtex
@article{faye2025rpo,
title = {Value-Free Policy Optimization via Reward Partitioning},
author = {Bilal Faye and Hanane Azzag and Mustapha Lebbah},
journal = {arXiv preprint arXiv:2406.XXXX},
year = {2025}
}
```
---
### πŸ”— Related Models
* `bilalfaye/flan-t5-small-rpo`
* `bilalfaye/flan-t5-large-rpo`
* `bilalfaye/flan-t5-xl-rpo`