File size: 3,236 Bytes
2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 d14f1ad 46df622 d14f1ad 46df622 2ce048a 46df622 2ce048a 46df622 2ce048a 46df622 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
base_model:
- google/flan-t5-small
- google/flan-t5-large
- google/flan-t5-xl
---
## π§ Flan-T5-{Small|Large|XL}-RPO
> π¬ Fine-tuned with **Reward Partitioning Optimization (RPO)** β a value-free, stable method for single-trajectory reinforcement learning with scalar feedback.
---
### π Model Summary
This model is a fine-tuned variant of the [Flan-T5](https://huggingface.co/google/flan-t5) {Small|Large|XL} checkpoint, trained using **Reward Partitioning Optimization (RPO)**. RPO is a new method designed for learning from single-trajectory scalar feedback (e.g., thumbs up/down), and eliminates the need for learning value functions or preference pairs.
* β
Trained with only (prompt, response, reward) triplets.
* π No joint optimization, no auxiliary models.
* π Efficient and stable training.
* π€ Strong preference alignment (evaluated by LLM-as-a-judge).
* π Outperforms KTO and DRO in automatic metrics and LLM preference winrate.
---
### π§ͺ Training Details
* **Base Model:** `flan-t5-{small|large|xl}`
* **Dataset:** [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) β high-quality (prompt, response, reward) triplets with multiple completions per prompt.
* **Feedback Format:** scalar reward (e.g., \[prompt, response, reward]).
* **GPU Used:** 1Γ A100 (80GB)
* **Training Objective:** RPO supervised learning using partitioned reward normalization.
* **Baselines Compared:** DRO and KTO.
---
### π€ Inference
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "bilalfaye/flan-t5-{small|large|xl}-rpo"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
prompt = "How can I improve my productivity working from home?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
input_ids=inputs["input_ids"],
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
```
---
### π Evaluation Summary
| Judge | Win Rate vs DRO | Win Rate vs KTO | Win Rate vs SFT |
| ------- | --------------- | --------------- | --------------- |
| Mistral | β
**83β93%** | β
**82β93%** | β
**82β84%** |
| LLaMA | β
**67β74%** | β
**65β72%** | β
**63β73%** |
---
### β
Use Cases
* Aligned conversational agents
* Helpful, non-toxic instruction following
* Scalar feedback training pipelines
* Preference-optimized generation (without pairwise preference labels)
---
### π Citation
If you use this model, please cite the following paper:
```bibtex
@article{faye2025rpo,
title = {Value-Free Policy Optimization via Reward Partitioning},
author = {Bilal Faye and Hanane Azzag and Mustapha Lebbah},
journal = {arXiv preprint arXiv:2406.XXXX},
year = {2025}
}
```
---
### π Related Models
* `bilalfaye/flan-t5-small-rpo`
* `bilalfaye/flan-t5-large-rpo`
* `bilalfaye/flan-t5-xl-rpo` |