Custom pseudo "fill in the middle" trained model, designed to handle varying "corruption rates" (randomized UTF8 character substitution). Two custom GRPO reward functions were used to improve the pre-existing SFT trained model in order to have it more reliably attend to the XML styling.
Designed to be used with the (jank, hacky, personalized) PyQT GUI tooling seen at: https://github.com/kalomaze/quest-tools
Wandb logs for this run can be found here, as well as the attached RL code. Full hyperparameters are observable in the configuration py as well.
Prompt Formatting
Trained without ChatML templating. This model uses a pattern of:
- Raw "corrupted" text at the beginning with UTF8 substitution for parts of the input.
- The "objective" as a Claude-style XML tag with newline separators.
- The beginning of an "original" tag.
def _format_prompt(self, example: Dict) -> str:
return (
f"{example['corrupted']}\n\n"
"<objective>\n"
"gently repair the <original> content\n"
"</objective>\n\n"
"<original>\n"
)
The primary utility of this model is as a means to synthesize rejected / lower quality preference data from pre-existing SFT data (i.e, the general pretraining corpus). This is useful in the context of teaching a reward model generalized preferences from lower quality, subtly incoherent base model-esque completions, of which are trivial to produce compared to human annotations.
Acknowledgements
Trained on 8xH200s provided free of charge by Deepshard for research & open source experimentation. Big McThankies.
- Downloads last month
- 14