Heralax
/

llama-gRPo-thoughtprocess

Model card Files Files and versions Community

Heralax commited on 10 days ago

Commit

94e56a9

·

verified ·

1 Parent(s): ea171df

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -18,8 +18,8 @@ Stop sequence == "\*\*Finished.\*\*"
 Related Links:
 - [Augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit)
-- [Augmentoolkit Factual Demo Model (the products of the quickstart)]([TODO](https://huggingface.co/Heralax/llama-Augmentoolkit-Quickstart-Factual-Demo))
-- [gRPo model (no thoughts)](TODO)
 Notes:
 This attempt at getting emotional responses mostly succeeded — when the model writes well, it writes *well*. However an interesting quirk emerged -- the model ended up putting most of its emotional exposition in the thought process, rather than in the final actual visible response. Possibly because it was graded higher when it explained its emotions exhaustively, possibly just because models trained to think with SFT first will pad their responses with thoughts a lot. While I think this overall came out well, it also serves as an example of what to look out for because of this. I think the next version will, in addition to the LLM-as-a-reward function approach, include a function which rewards final answers above a certain length.

 Related Links:
 - [Augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit)
+- [Augmentoolkit Factual Demo Model (the products of the quickstart)](https://huggingface.co/Heralax/llama-Augmentoolkit-Quickstart-Factual-Demo)
+- [gRPo model (no thoughts)](https://huggingface.co/Heralax/llama-gRPo-emotions-nothoughts)
 Notes:
 This attempt at getting emotional responses mostly succeeded — when the model writes well, it writes *well*. However an interesting quirk emerged -- the model ended up putting most of its emotional exposition in the thought process, rather than in the final actual visible response. Possibly because it was graded higher when it explained its emotions exhaustively, possibly just because models trained to think with SFT first will pad their responses with thoughts a lot. While I think this overall came out well, it also serves as an example of what to look out for because of this. I think the next version will, in addition to the LLM-as-a-reward function approach, include a function which rewards final answers above a certain length.