Uploaded model : Rward Modelling for grpo - reasoning
- Developed by: LeroyDyer
- License: apache-2.0
- Finetuned from model : LeroyDyer/CheckPoint_D_r1
Sliding Window To NULL
First Attempt at reasoning:
The next models will be the reoning series which will end up in a final merge of R models : Flattening the model to base ! Ready to merge with the checkpoint original serties : It also means that most models in the future will be trained on reasoning processes and react processes with rewards modelling :
Th rewards models forces the output format :
Thwse models havr been intensly trained on chain of thoughts so convergence was no problem for the model , it trainied within 10 steps ! In fact super fast also ~ ! So i will revists some of my previous trained datasets and do reward modelling for them :
We ask why ?
This is because for a answer to become totally generalised it needs options ie many completions for the same or simular task . different routes : We can distill away froom larger models but it does not train the model it give the model options which include the new responses : we need to rards train the model to get it to actually think how that answer was arrived at . So math training seems obvious as we need the model to determine how 1+2 = 3 not rrepeat it because it is very highly likely due to the training so far !
This mistral model was trained 2x faster with Unsloth and Huggingface's TRL library.
- Downloads last month
- 144