togethercomputer
/

gemma-2-9b-it-MoAA-DPO

Text Generation

text-generation-inference

Model card Files Files and versions

gemma-2-9b-it-MoAA-DPO / README.md

ShangZhu-Together's picture

ShangZhu-Together

Update README.md

2c07bc5 verified 2 months ago

|

history blame contribute delete

2.69 kB

	---
	library_name: transformers
	tags: []
	---




	## Model Description

	This is the DPO model in our Mixture of Agents Alignment (MoAA) pipeline. This model is tuned on the Gemma-2-9b-it. MoAA is an approach that leverages collective intelligence from open‑source LLMs to advance alignment.

	Two mains stages are involved in our MoAA method. In the first stage, we employ MoA to produce high-quality synthetic data for supervised fine-tuning. In the second stage, we combines multiple LLMs as a reward model to provide preference annotations.

	Some key takeaways of our work:



	- 📈Alignment pipeline that actually works Our MoAA method sends Llama‑3.1‑8B‑Instruct’s Arena‑Hard 19 → 48 and Gemma-2-9B-it 42→56, handily beating GPT‑4o‑labeled sets at the time.

	- 🏆Ensembled rewards > single critics An MoA reward model with dynamic criteria filtering edges out competitive ArmoRM on MT‑Bench & Arena‑Hard—all while staying 100 % open source.

	- 🚀Self‑improvement unlocked Fine‑tune the strongest model inside the ensemble on MoAA data and it surpasses its own teachers—evidence that open models can push past proprietary ceilings without external supervision.


	## Model Sources


	For more details refer to

	- [Paper](https://arxiv.org/abs/2505.03059)

	<!-- - [twitter](https://arxiv.org/abs/2505.03059)
	- [blgopost](https://arxiv.org/abs/2505.03059) -->



	## How to Get Started with the Model

	Use the code below to get started with the model.

	Run inference like this:
	```
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("togethercomputer/gemma-2-9b-it-MoAA-DPO")
	model = AutoModelForCausalLM.from_pretrained("togethercomputer/gemma-2-9b-it-MoAA-DPO")
	```


	## Training Data

	We sample 5 responses from the previously trained SFT model and use a reward model to select the preferred and rejected responses for preference learning. Specifically, we utilize the reward model to identify the highest-scoring response as the "chosen" response and the lowest-scoring response as the "rejected" response for each method, and here we propose a novel technique that leverages MoA as a reward model.

	## Evaluation & Performance

	Refer to [Paper](https://arxiv.org/abs/2505.03059) for metrics.





	## Citation
	```
	@article{wang2025improving,
	title = {Improving Model Alignment Through Collective Intelligence of Open-Source LLMS},
	author = {Junlin Wang and Roy Xie and Shang Zhu and Jue Wang and Ben Athiwaratkun and Bhuwan Dhingra and Shuaiwen Leon Song and Ce Zhang and James Zou},
	year = {2025},
	journal = {arXiv preprint arXiv: 2505.03059}
	}
	```