berkeley-nest
/

Starling-RM-7B-alpha

text-generation-inference

Model card Files Files and versions Community

evanfrick commited on Nov 27, 2023

Commit

5a58bd5

·

1 Parent(s): a55a459

fix link

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -15,9 +15,9 @@ tags:
 <!-- Provide a quick summary of what the model is/does. -->
 Starling-RM-7B-alpha is a reward model trained from [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). Following the method of training reward model in [the instructGPT paper](https://arxiv.org/abs/2203.02155), we remove the last layer of Llama2-7B Chat,
-and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest/Nectar),
 with the K-wise maximum likelihood estimator proposed in [this paper](https://arxiv.org/abs/2301.11270). The reward model outputs a scalar for any given prompt and response. A response that is more helpful and
-less harmful will get the highest reward score. Note that since the preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest/Nectar) is based on GPT-4 preference, the reward model is likely to be biased
 towards GPT-4's own preference, including longer responses and certain response format.
 For more detailed discussions, please check out our [blog post](https://starling.cs.berkeley.edu), and stay tuned for our upcoming code and paper!

 <!-- Provide a quick summary of what the model is/does. -->
 Starling-RM-7B-alpha is a reward model trained from [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). Following the method of training reward model in [the instructGPT paper](https://arxiv.org/abs/2203.02155), we remove the last layer of Llama2-7B Chat,
+and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar),
 with the K-wise maximum likelihood estimator proposed in [this paper](https://arxiv.org/abs/2301.11270). The reward model outputs a scalar for any given prompt and response. A response that is more helpful and
+less harmful will get the highest reward score. Note that since the preference dataset [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar) is based on GPT-4 preference, the reward model is likely to be biased
 towards GPT-4's own preference, including longer responses and certain response format.
 For more detailed discussions, please check out our [blog post](https://starling.cs.berkeley.edu), and stay tuned for our upcoming code and paper!