EurusPRM-Stage1 / README.md
nielsr's picture
nielsr HF staff
Add library_name and pipeline_tag
965fbc6
|
raw
history blame
9.87 kB
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation

EurusPRM-Stage2

Links

Introduction

EurusPRM-Stage1 is trained using Implicit PRM, which obtains free process rewards at no additional cost but just needs to simply train an ORM on the cheaper response-level labels. During inference, implicit process rewards are obtained by forward passing and calculating the log-likelihood ratio on each step. It serves a great fundation for further training of EurusPRM-Stage2.

prm

The key ingredient of Implicit PRM is the reward representation, as demonstrated below: