lievan commited on
Commit
6bc4777
·
verified ·
1 Parent(s): 2c5e5ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -33
README.md CHANGED
@@ -13,28 +13,28 @@ tags:
13
  - 🤗 [UltraInteract](https://huggingface.co/datasets/openbmb/UltraInteract)
14
  -
15
 
16
- # UltraRM
17
 
18
- We train and release a reward model UltraRM based on UltraFeedback to further facilitate alignment research. UltraRM is initialized by LLaMA2-13B.
19
 
20
- Specifically, we train two versions of reward models, where UltraRM-UF is merely fine-tuned on UltraFeedback and UltraRM is fine-tuned on a mixture of UltraFeedback and an equal-size sample from three open-source datasets including [Anthropic HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), [Standford SHP](https://huggingface.co/datasets/stanfordnlp/SHP), and [Summarization](https://huggingface.co/datasets/openai/summarize_from_feedback).
 
 
 
21
 
22
- ## Reward Modeling
23
-
24
- On four public preference test sets, our UltraRM achieves SOTA over other open-source reward models.
25
 
26
  ## Usage
27
  ```python
28
- from transformers import PreTrainedModel, LlamaConfig, LlamaModel, LlamaTokenizer
29
  import torch.nn as nn
30
  import torch
31
  from typing import Optional, List
32
 
33
- class LlamaRewardModel(PreTrainedModel):
34
- config_class = LlamaConfig
35
  def __init__(self, config):
36
  super().__init__(config)
37
- self.model = LlamaModel(config)
38
  self.regression_head = nn.Linear(self.config.hidden_size, 1, bias=False)
39
 
40
  def forward( # args are the same as LlamaForCausalLM
@@ -67,38 +67,36 @@ class LlamaRewardModel(PreTrainedModel):
67
 
68
  return rewards
69
 
70
- ultrarm_template = """Human: {instruction}
71
-
72
- Assistant: {completion}"""
73
 
74
- dataset = [ # cases in webgpt
75
- {"chosen":"Human: \"Who orders martinis \"\"shaken, not stirred\"\"?\"\nAssistant: Sean Connery's character, fictional British Secret Service agent James Bond, in the movie Goldfinger, stated that he preferred his martini to be \"shaken, not stirred\". [1] Some believe that Bond ordered his martini shaken because of the vodka it contained, as vodka was, for the most part, refined from potatoes (cheaper brands) which made the vodka oily. To disperse the oil, Bond ordered his martinis shaken. [2]","rejected":"Human: \"Who orders martinis \"\"shaken, not stirred\"\"?\"\nAssistant: Fleming's fictional British Secret Service agent James Bond orders his martini cocktail shaken, not stirred [1]. Bond's preferences for his martini are carried over to the films, where his orders are seen in both the 1961 film Dr. No and the 2006 film Casino Royale [1, 2]. In both films, Bond's subordinates copy his order, telling the bartender to keep the fruit with their drinks [2]. However, in the 2006 film, Bond appears irritated when the bartender asks if he would like his drink shaken or stirred [2]."},
76
- {"chosen":"Human: Sural relates to which part of the body?\nAssistant: The sural region is the muscular swelling of the back of the leg below the knee, formed chiefly by the bellies of the gastrocnemius and soleus muscles [1,2].","rejected":"Human: Sural relates to which part of the body?\nAssistant: The Sural nerve runs down the side of the leg near the small saphenous vein, then passes forward below the lateral malleolus and continues on the outside of the foot as the lateral dorsal cutaneous nerve, which then communicates with the intermediate dorsal cutaneous nerve, which branches off to the side of the foot. [1]"}
77
- ]
 
78
 
79
 
80
- tokenizer = LlamaTokenizer.from_pretrained("/data/UltraRM-13b")
81
- model = LlamaRewardModel.from_pretrained("/data/UltraRM-13b")
 
82
 
83
- for example in dataset:
84
- inputs = tokenizer(example["chosen"], return_tensors="pt")
85
- chosen_reward = model(**inputs).item()
86
- inputs = tokenizer(example["rejected"], return_tensors="pt")
87
- rejected_reward = model(**inputs).item()
88
- print(chosen_reward - rejected_reward)
89
 
90
- # Output 1: 2.4158712085336447
91
- # Output 2: 0.1896953582763672
 
92
  ```
93
 
94
  ## Citation
95
  ```
96
- @misc{cui2023ultrafeedback,
97
- title={UltraFeedback: Boosting Language Models with High-quality Feedback},
98
- author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Wei Zhu and Yuan Ni and Guotong Xie and Zhiyuan Liu and Maosong Sun},
99
- year={2023},
100
- eprint={2310.01377},
101
- archivePrefix={arXiv},
102
  primaryClass={cs.CL}
103
  }
104
  ```
 
13
  - 🤗 [UltraInteract](https://huggingface.co/datasets/openbmb/UltraInteract)
14
  -
15
 
16
+ # Introduction
17
 
18
+ Eurus-RM-7B is trained on a mixture of [UltraInteract](https://huggingface.co/datasets/openbmb/UltraInteract), [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback), and [UltraSafety](https://huggingface.co/datasets/openbmb/UltraSafety), with a specifically designed reward modeling objective for reasoning to directly increase.
19
 
20
+ - EURUS-RM-7B stands out as the best 7B RM overall and achieves similar or better performance than much larger baselines. Particularly, it outperforms GPT-4 in certain tasks.
21
+ - Our training objective is beneficial in improving RM performance on hard problems and reasoning.
22
+ - ULTRAINTERACT is compatible with other datasets like UltraFeedback and UltraSafety, and mixing these datasets can balance different RM abilities.
23
+ - EURUS-RM-7B improves LLMs’ reasoning performance by a large margin through reranking.
24
 
 
 
 
25
 
26
  ## Usage
27
  ```python
28
+ from transformers import PreTrainedModel, AutoModel, AutoTokenizer, MistralConfig
29
  import torch.nn as nn
30
  import torch
31
  from typing import Optional, List
32
 
33
+ class EurusRewardModel(PreTrainedModel):
34
+ config_class = MistralConfig
35
  def __init__(self, config):
36
  super().__init__(config)
37
+ self.model = AutoModel.from_pretrained(config)
38
  self.regression_head = nn.Linear(self.config.hidden_size, 1, bias=False)
39
 
40
  def forward( # args are the same as LlamaForCausalLM
 
67
 
68
  return rewards
69
 
 
 
 
70
 
71
+ def test(model_path):
72
+ dataset = [ # cases in webgpt; we use the same template as Mistral-Instruct-v0.2
73
+ {"chosen":"[INST] \"Who orders martinis \"\"shaken, not stirred\"\"?\" [\INST] Sean Connery's character, fictional British Secret Service agent James Bond, in the movie Goldfinger, stated that he preferred his martini to be \"shaken, not stirred\". [1] Some believe that Bond ordered his martini shaken because of the vodka it contained, as vodka was, for the most part, refined from potatoes (cheaper brands) which made the vodka oily. To disperse the oil, Bond ordered his martinis shaken. [2]","rejected":"[INST] \"Who orders martinis \"\"shaken, not stirred\"\"?\" [\INST] Fleming's fictional British Secret Service agent James Bond orders his martini cocktail shaken, not stirred [1]. Bond's preferences for his martini are carried over to the films, where his orders are seen in both the 1961 film Dr. No and the 2006 film Casino Royale [1, 2]. In both films, Bond's subordinates copy his order, telling the bartender to keep the fruit with their drinks [2]. However, in the 2006 film, Bond appears irritated when the bartender asks if he would like his drink shaken or stirred [2]."},
74
+ {"chosen":"[INST] Sural relates to which part of the body? [\INST] The sural region is the muscular swelling of the back of the leg below the knee, formed chiefly by the bellies of the gastrocnemius and soleus muscles [1,2].","rejected":"[INST] Sural relates to which part of the body? [\INST] The Sural nerve runs down the side of the leg near the small saphenous vein, then passes forward below the lateral malleolus and continues on the outside of the foot as the lateral dorsal cutaneous nerve, which then communicates with the intermediate dorsal cutaneous nerve, which branches off to the side of the foot. [1]"}
75
+ ]
76
 
77
 
78
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
79
+ config = AutoConfig.from_pretrained(model_path)
80
+ model = EurusRewardModel(config)
81
 
82
+ for example in dataset:
83
+ inputs = tokenizer(example["chosen"], return_tensors="pt")
84
+ chosen_reward = model(**inputs).item()
85
+ inputs = tokenizer(example["rejected"], return_tensors="pt")
86
+ rejected_reward = model(**inputs).item()
87
+ print(chosen_reward - rejected_reward)
88
 
89
+ test("openbmb/Eurus-RM-7b")
90
+ # Output 1: 0.14470714330673218
91
+ # Output 2: 0.7317184507846832
92
  ```
93
 
94
  ## Citation
95
  ```
96
+ @misc{yuan2024advancing,
97
+ title={Advancing LLM Reasoning Generalists with Preference Trees},
98
+ author={Lifan Yuan and Ganqu Cui and Hanbin Wang and Ning Ding and Xingyao Wang and Jia Deng and Boji Shan and Huimin Chen and Ruobing Xie and Yankai Lin and Zhenghao Liu and Bowen Zhou and Hao Peng and Zhiyuan Liu and Maosong Sun},
99
+ year={2024},
 
 
100
  primaryClass={cs.CL}
101
  }
102
  ```