--- base_model: - meta-llama/Llama-3.1-8B-Instruct - meta-llama/Llama-3.2-3B-Instruct - meta-llama/Llama-3.2-1B-Instruct - meta-llama/Meta-Llama-3-8B-Instruct - mistralai/Mistral-7B-Instruct-v0.1 - mistralai/Mistral-7B-Instruct-v0.2 - mistralai/Mistral-7B-Instruct-v0.3 datasets: - PKU-Alignment/PKU-SafeRLHF - HuggingFaceH4/ultrafeedback_binarized - Anthropic/hh-rlhf - PKU-Alignment/BeaverTails-Evaluation - declare-lab/HarmfulQA language: - en ---
![]() Performance improvements of MARA across PKUSafeRLHF, BeaverTails, and HarmfulQA datasets. Each entry shows the percentage improvement in preference rate achieved by applying MARA compared to using the original LLM alone.
|
![]() Compatibility analysis of MARA, an alignment model trained with a LLM to be aggregate with other inference LLM. The value of each cell represents the percentage improvement in preference rate of our algorithm over the upstream model, i.e., inference model.
|
![]() Performance comparison of MARA against RLHF, DPO, and Aligner measured by percentage improvements of preference rate.
|