super duper ultra highly experimental lora finetune of EleutherAI/pythia-410m-deduped on argilla/dpo-mix-7k, to be a reward model.
nexusflow achieved good results with traditional reward model finetuning! why not meeeeeee :3
Base model