DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step120-reward 2B • Updated May 5 • 6
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step120-actor 2B • Updated May 5 • 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step110-reward 2B • Updated May 5 • 6
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step110-actor 2B • Updated May 5 • 9
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step100-reward 2B • Updated May 5 • 7
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step100-actor 2B • Updated May 5 • 6
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step90-reward 2B • Updated May 5 • 9
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step90-actor 2B • Updated May 5 • 7
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step80-reward 2B • Updated May 5 • 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step80-actor 2B • Updated May 5 • 5