I have a dataset that is gpt-4-turbo as chosen and a lower performing model as rejected. The objective should therefore be fairly easy because the two are easy to discern. As a consequence, the model achieves very low losses (0.021 train; 0.013 validation) and high reward accuracies (0.995). **However**, when using the model in practice, it often detoriates after the first one or two tokens and continuously outputs sequences of
/*****/
. So despite the good performance on the DPO objective and strong scores on the validation set (no overfitting), something seems to go wrong. Perhaps the outputs are too different and the task is too easy, in which case DPO is not useful. But why then would the model start hallucinating and repeating the same token over and over again?Any thoughts? Any suggestions to get around this? All discussions are welcome!