bilalfaye
/

flan-t5-xl-rpo

Model card Files Files and versions Community

flan-t5-xl-rpo / README.md

bilalfaye's picture

Update README.md

d14f1ad verified about 2 months ago

|

history blame contribute delete

3.24 kB

	---
	base_model:
	- google/flan-t5-small
	- google/flan-t5-large
	- google/flan-t5-xl
	---

	## 🧠 Flan-T5-{Small\|Large\|XL}-RPO

	> 🔬 Fine-tuned with Reward Partitioning Optimization (RPO) — a value-free, stable method for single-trajectory reinforcement learning with scalar feedback.

	---

	### 📌 Model Summary

	This model is a fine-tuned variant of the [Flan-T5](https://huggingface.co/google/flan-t5) {Small\|Large\|XL} checkpoint, trained using Reward Partitioning Optimization (RPO). RPO is a new method designed for learning from single-trajectory scalar feedback (e.g., thumbs up/down), and eliminates the need for learning value functions or preference pairs.

	* ✅ Trained with only (prompt, response, reward) triplets.
	* 🔁 No joint optimization, no auxiliary models.
	* 🚀 Efficient and stable training.
	* 🤖 Strong preference alignment (evaluated by LLM-as-a-judge).
	* 📊 Outperforms KTO and DRO in automatic metrics and LLM preference winrate.

	---

	### 🧪 Training Details

	* Base Model: `flan-t5-{small\|large\|xl}`
	* Dataset: [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) — high-quality (prompt, response, reward) triplets with multiple completions per prompt.
	* Feedback Format: scalar reward (e.g., \[prompt, response, reward]).
	* GPU Used: 1× A100 (80GB)
	* Training Objective: RPO supervised learning using partitioned reward normalization.
	* Baselines Compared: DRO and KTO.

	---

	### 🤖 Inference

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
	import torch

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model_name = "bilalfaye/flan-t5-{small\|large\|xl}-rpo"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

	prompt = "How can I improve my productivity working from home?"
	inputs = tokenizer(prompt, return_tensors="pt").to(device)

	outputs = model.generate(
	input_ids=inputs["input_ids"],
	max_new_tokens=128,
	do_sample=True,
	temperature=0.7,
	top_k=50,
	top_p=0.95,
	repetition_penalty=1.2,
	no_repeat_ngram_size=3,
	)

	response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
	print(response)
	```

	---

	### 📈 Evaluation Summary

	\| Judge \| Win Rate vs DRO \| Win Rate vs KTO \| Win Rate vs SFT \|
	\| ------- \| --------------- \| --------------- \| --------------- \|
	\| Mistral \| ✅ 83–93% \| ✅ 82–93% \| ✅ 82–84% \|
	\| LLaMA \| ✅ 67–74% \| ✅ 65–72% \| ✅ 63–73% \|

	---

	### ✅ Use Cases

	* Aligned conversational agents
	* Helpful, non-toxic instruction following
	* Scalar feedback training pipelines
	* Preference-optimized generation (without pairwise preference labels)

	---

	### 📚 Citation

	If you use this model, please cite the following paper:

	```bibtex
	@article{faye2025rpo,
	title = {Value-Free Policy Optimization via Reward Partitioning},
	author = {Bilal Faye and Hanane Azzag and Mustapha Lebbah},
	journal = {arXiv preprint arXiv:2406.XXXX},
	year = {2025}
	}
	```

	---

	### 🔗 Related Models

	* `bilalfaye/flan-t5-small-rpo`
	* `bilalfaye/flan-t5-large-rpo`
	* `bilalfaye/flan-t5-xl-rpo`