--- base_model: - google/flan-t5-small - google/flan-t5-large - google/flan-t5-xl --- ## ๐Ÿง  Flan-T5-{Small|Large|XL}-RPO > ๐Ÿ”ฌ Fine-tuned with **Reward Partitioning Optimization (RPO)** โ€” a value-free, stable method for single-trajectory reinforcement learning with scalar feedback. --- ### ๐Ÿ“Œ Model Summary This model is a fine-tuned variant of the [Flan-T5](https://huggingface.co/google/flan-t5) {Small|Large|XL} checkpoint, trained using **Reward Partitioning Optimization (RPO)**. RPO is a new method designed for learning from single-trajectory scalar feedback (e.g., thumbs up/down), and eliminates the need for learning value functions or preference pairs. * โœ… Trained with only (prompt, response, reward) triplets. * ๐Ÿ” No joint optimization, no auxiliary models. * ๐Ÿš€ Efficient and stable training. * ๐Ÿค– Strong preference alignment (evaluated by LLM-as-a-judge). * ๐Ÿ“Š Outperforms KTO and DRO in automatic metrics and LLM preference winrate. --- ### ๐Ÿงช Training Details * **Base Model:** `flan-t5-{small|large|xl}` * **Dataset:** [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) โ€” high-quality (prompt, response, reward) triplets with multiple completions per prompt. * **Feedback Format:** scalar reward (e.g., \[prompt, response, reward]). * **GPU Used:** 1ร— A100 (80GB) * **Training Objective:** RPO supervised learning using partitioned reward normalization. * **Baselines Compared:** DRO and KTO. --- ### ๐Ÿค– Inference ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model_name = "bilalfaye/flan-t5-{small|large|xl}-rpo" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device) prompt = "How can I improve my productivity working from home?" inputs = tokenizer(prompt, return_tensors="pt").to(device) outputs = model.generate( input_ids=inputs["input_ids"], max_new_tokens=128, do_sample=True, temperature=0.7, top_k=50, top_p=0.95, repetition_penalty=1.2, no_repeat_ngram_size=3, ) response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] print(response) ``` --- ### ๐Ÿ“ˆ Evaluation Summary | Judge | Win Rate vs DRO | Win Rate vs KTO | Win Rate vs SFT | | ------- | --------------- | --------------- | --------------- | | Mistral | โœ… **83โ€“93%** | โœ… **82โ€“93%** | โœ… **82โ€“84%** | | LLaMA | โœ… **67โ€“74%** | โœ… **65โ€“72%** | โœ… **63โ€“73%** | --- ### โœ… Use Cases * Aligned conversational agents * Helpful, non-toxic instruction following * Scalar feedback training pipelines * Preference-optimized generation (without pairwise preference labels) --- ### ๐Ÿ“š Citation If you use this model, please cite the following paper: ```bibtex @article{faye2025rpo, title = {Value-Free Policy Optimization via Reward Partitioning}, author = {Bilal Faye and Hanane Azzag and Mustapha Lebbah}, journal = {arXiv preprint arXiv:2406.XXXX}, year = {2025} } ``` --- ### ๐Ÿ”— Related Models * `bilalfaye/flan-t5-small-rpo` * `bilalfaye/flan-t5-large-rpo` * `bilalfaye/flan-t5-xl-rpo`