Uploaded model : Rward Modelling for grpo - reasoning

  • Developed by: LeroyDyer
  • License: apache-2.0
  • Finetuned from model : LeroyDyer/CheckPoint_D_r1

Sliding Window To NULL

First Attempt at reasoning:

The next models will be the reoning series which will end up in a final merge of R models : Flattening the model to base ! Ready to merge with the checkpoint original serties : It also means that most models in the future will be trained on reasoning processes and react processes with rewards modelling :

Th rewards models forces the output format :

Thwse models havr been intensly trained on chain of thoughts so convergence was no problem for the model , it trainied within 10 steps ! In fact super fast also ~ ! So i will revists some of my previous trained datasets and do reward modelling for them :

We ask why ?

This is because for a answer to become totally generalised it needs options ie many completions for the same or simular task . different routes : We can distill away froom larger models but it does not train the model it give the model options which include the new responses : we need to rards train the model to get it to actually think how that answer was arrived at . So math training seems obvious as we need the model to determine how 1+2 = 3 not rrepeat it because it is very highly likely due to the training so far !

This mistral model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
144
Safetensors
Model size
7.24B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.