---
base_model:
- google/flan-t5-small
- google/flan-t5-large
- google/flan-t5-xl
---

## 🧠 Flan-T5-{Small|Large|XL}-RPO

> 🔬 Fine-tuned with **Reward Partitioning Optimization (RPO)** — a value-free, stable method for single-trajectory reinforcement learning with scalar feedback.

---

### 📌 Model Summary

This model is a fine-tuned variant of the [Flan-T5](https://huggingface.co/google/flan-t5) {Small|Large|XL} checkpoint, trained using **Reward Partitioning Optimization (RPO)**. RPO is a new method designed for learning from single-trajectory scalar feedback (e.g., thumbs up/down), and eliminates the need for learning value functions or preference pairs.

* ✅ Trained with only (prompt, response, reward) triplets.
* 🔁 No joint optimization, no auxiliary models.
* 🚀 Efficient and stable training.
* 🤖 Strong preference alignment (evaluated by LLM-as-a-judge).
* 📊 Outperforms KTO and DRO in automatic metrics and LLM preference winrate.

---

### 🧪 Training Details

* **Base Model:** `flan-t5-{small|large|xl}`
* **Dataset:** [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) — high-quality (prompt, response, reward) triplets with multiple completions per prompt.
* **Feedback Format:** scalar reward (e.g., \[prompt, response, reward]).
* **GPU Used:** 1× A100 (80GB)
* **Training Objective:** RPO supervised learning using partitioned reward normalization.
* **Baselines Compared:** DRO and KTO.

---

### 🤖 Inference

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "bilalfaye/flan-t5-{small|large|xl}-rpo"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

prompt = "How can I improve my productivity working from home?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
```

---

### 📈 Evaluation Summary

| Judge   | Win Rate vs DRO | Win Rate vs KTO | Win Rate vs SFT |
| ------- | --------------- | --------------- | --------------- |
| Mistral | ✅ **83–93%**    | ✅ **82–93%**    | ✅ **82–84%**    |
| LLaMA   | ✅ **67–74%**    | ✅ **65–72%**    | ✅ **63–73%**    |

---

### ✅ Use Cases

* Aligned conversational agents
* Helpful, non-toxic instruction following
* Scalar feedback training pipelines
* Preference-optimized generation (without pairwise preference labels)

---

### 📚 Citation

If you use this model, please cite the following paper:

```bibtex
@article{faye2025rpo,
  title     = {Value-Free Policy Optimization via Reward Partitioning},
  author    = {Bilal Faye and Hanane Azzag and Mustapha Lebbah},
  journal   = {arXiv preprint arXiv:2406.XXXX},
  year      = {2025}
}
```

---

### 🔗 Related Models

* `bilalfaye/flan-t5-small-rpo`
* `bilalfaye/flan-t5-large-rpo`
* `bilalfaye/flan-t5-xl-rpo`