File size: 3,236 Bytes

2ce048a
46df622
 
 
 
2ce048a
 
46df622
2ce048a
46df622
2ce048a
46df622
2ce048a
46df622
2ce048a
46df622
2ce048a
46df622
 
 
 
 
2ce048a
46df622
2ce048a
46df622
2ce048a
46df622
 
 
 
 
 
2ce048a
46df622
2ce048a
46df622
2ce048a
46df622
 
 
2ce048a
46df622
 
2ce048a
46df622
 
2ce048a
46df622
 
2ce048a
46df622
 
 
 
 
 
 
 
 
 
2ce048a
46df622
 
 
2ce048a
46df622
2ce048a
46df622
2ce048a
46df622
 
 
 
2ce048a
46df622
2ce048a
46df622
2ce048a
46df622
 
 
 
2ce048a
46df622
2ce048a
46df622
2ce048a
46df622
2ce048a
46df622
d14f1ad
46df622
 
 
d14f1ad
46df622
 
2ce048a
46df622
2ce048a
46df622
2ce048a
46df622

---
base_model:
- google/flan-t5-small
- google/flan-t5-large
- google/flan-t5-xl
---

## 🧠 Flan-T5-{Small|Large|XL}-RPO

> 🔬 Fine-tuned with **Reward Partitioning Optimization (RPO)** — a value-free, stable method for single-trajectory reinforcement learning with scalar feedback.

---

### 📌 Model Summary

This model is a fine-tuned variant of the [Flan-T5](https://huggingface.co/google/flan-t5) {Small|Large|XL} checkpoint, trained using **Reward Partitioning Optimization (RPO)**. RPO is a new method designed for learning from single-trajectory scalar feedback (e.g., thumbs up/down), and eliminates the need for learning value functions or preference pairs.

* ✅ Trained with only (prompt, response, reward) triplets.
* 🔁 No joint optimization, no auxiliary models.
* 🚀 Efficient and stable training.
* 🤖 Strong preference alignment (evaluated by LLM-as-a-judge).
* 📊 Outperforms KTO and DRO in automatic metrics and LLM preference winrate.

---

### 🧪 Training Details

* **Base Model:** `flan-t5-{small|large|xl}`
* **Dataset:** [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) — high-quality (prompt, response, reward) triplets with multiple completions per prompt.
* **Feedback Format:** scalar reward (e.g., \[prompt, response, reward]).
* **GPU Used:** 1× A100 (80GB)
* **Training Objective:** RPO supervised learning using partitioned reward normalization.
* **Baselines Compared:** DRO and KTO.

---

### 🤖 Inference

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "bilalfaye/flan-t5-{small|large|xl}-rpo"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

prompt = "How can I improve my productivity working from home?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
```

---

### 📈 Evaluation Summary

| Judge   | Win Rate vs DRO | Win Rate vs KTO | Win Rate vs SFT |
| ------- | --------------- | --------------- | --------------- |
| Mistral | ✅ **83–93%**    | ✅ **82–93%**    | ✅ **82–84%**    |
| LLaMA   | ✅ **67–74%**    | ✅ **65–72%**    | ✅ **63–73%**    |

---

### ✅ Use Cases

* Aligned conversational agents
* Helpful, non-toxic instruction following
* Scalar feedback training pipelines
* Preference-optimized generation (without pairwise preference labels)

---

### 📚 Citation

If you use this model, please cite the following paper:

```bibtex
@article{faye2025rpo,
  title     = {Value-Free Policy Optimization via Reward Partitioning},
  author    = {Bilal Faye and Hanane Azzag and Mustapha Lebbah},
  journal   = {arXiv preprint arXiv:2406.XXXX},
  year      = {2025}
}
```

---

### 🔗 Related Models

* `bilalfaye/flan-t5-small-rpo`
* `bilalfaye/flan-t5-large-rpo`
* `bilalfaye/flan-t5-xl-rpo`