|
--- |
|
base_model: |
|
- google/flan-t5-small |
|
- google/flan-t5-large |
|
- google/flan-t5-xl |
|
--- |
|
|
|
## π§ Flan-T5-{Small|Large|XL}-RPO |
|
|
|
> π¬ Fine-tuned with **Reward Partitioning Optimization (RPO)** β a value-free, stable method for single-trajectory reinforcement learning with scalar feedback. |
|
|
|
--- |
|
|
|
### π Model Summary |
|
|
|
This model is a fine-tuned variant of the [Flan-T5](https://huggingface.co/google/flan-t5) {Small|Large|XL} checkpoint, trained using **Reward Partitioning Optimization (RPO)**. RPO is a new method designed for learning from single-trajectory scalar feedback (e.g., thumbs up/down), and eliminates the need for learning value functions or preference pairs. |
|
|
|
* β
Trained with only (prompt, response, reward) triplets. |
|
* π No joint optimization, no auxiliary models. |
|
* π Efficient and stable training. |
|
* π€ Strong preference alignment (evaluated by LLM-as-a-judge). |
|
* π Outperforms KTO and DRO in automatic metrics and LLM preference winrate. |
|
|
|
--- |
|
|
|
### π§ͺ Training Details |
|
|
|
* **Base Model:** `flan-t5-{small|large|xl}` |
|
* **Dataset:** [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) β high-quality (prompt, response, reward) triplets with multiple completions per prompt. |
|
* **Feedback Format:** scalar reward (e.g., \[prompt, response, reward]). |
|
* **GPU Used:** 1Γ A100 (80GB) |
|
* **Training Objective:** RPO supervised learning using partitioned reward normalization. |
|
* **Baselines Compared:** DRO and KTO. |
|
|
|
--- |
|
|
|
### π€ Inference |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
import torch |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model_name = "bilalfaye/flan-t5-{small|large|xl}-rpo" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device) |
|
|
|
prompt = "How can I improve my productivity working from home?" |
|
inputs = tokenizer(prompt, return_tensors="pt").to(device) |
|
|
|
outputs = model.generate( |
|
input_ids=inputs["input_ids"], |
|
max_new_tokens=128, |
|
do_sample=True, |
|
temperature=0.7, |
|
top_k=50, |
|
top_p=0.95, |
|
repetition_penalty=1.2, |
|
no_repeat_ngram_size=3, |
|
) |
|
|
|
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] |
|
print(response) |
|
``` |
|
|
|
--- |
|
|
|
### π Evaluation Summary |
|
|
|
| Judge | Win Rate vs DRO | Win Rate vs KTO | Win Rate vs SFT | |
|
| ------- | --------------- | --------------- | --------------- | |
|
| Mistral | β
**83β93%** | β
**82β93%** | β
**82β84%** | |
|
| LLaMA | β
**67β74%** | β
**65β72%** | β
**63β73%** | |
|
|
|
--- |
|
|
|
### β
Use Cases |
|
|
|
* Aligned conversational agents |
|
* Helpful, non-toxic instruction following |
|
* Scalar feedback training pipelines |
|
* Preference-optimized generation (without pairwise preference labels) |
|
|
|
--- |
|
|
|
### π Citation |
|
|
|
If you use this model, please cite the following paper: |
|
|
|
```bibtex |
|
@article{faye2025rpo, |
|
title = {Value-Free Policy Optimization via Reward Partitioning}, |
|
author = {Bilal Faye and Hanane Azzag and Mustapha Lebbah}, |
|
journal = {arXiv preprint arXiv:2406.XXXX}, |
|
year = {2025} |
|
} |
|
``` |
|
|
|
--- |
|
|
|
### π Related Models |
|
|
|
* `bilalfaye/flan-t5-small-rpo` |
|
* `bilalfaye/flan-t5-large-rpo` |
|
* `bilalfaye/flan-t5-xl-rpo` |