Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
anakin87ย 
posted an update about 1 month ago
Post
3403
๐—œ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ฒ๐—ฑ ๐—ฎ ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜๐—ผ ๐˜€๐—ฐ๐—ต๐—ฒ๐—ฑ๐˜‚๐—น๐—ฒ ๐—ฒ๐˜ƒ๐—ฒ๐—ป๐˜๐˜€ ๐˜„๐—ถ๐˜๐—ต ๐—š๐—ฅ๐—ฃ๐—ข! ๐Ÿ‘‘ ๐Ÿ—“๏ธ

โœ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo

I experimented with GRPO lately.

I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.

After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...

I wanted a different challenge, like ๐˜๐—ฒ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ด ๐—ฎ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜๐—ผ ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ ๐—ฎ ๐˜€๐—ฐ๐—ต๐—ฒ๐—ฑ๐˜‚๐—น๐—ฒ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ฎ ๐—น๐—ถ๐˜€๐˜ ๐—ผ๐—ณ ๐—ฒ๐˜ƒ๐—ฒ๐—ป๐˜๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ฝ๐—ฟ๐—ถ๐—ผ๐—ฟ๐—ถ๐˜๐—ถ๐—ฒ๐˜€.

Choosing an original problem forced me to:
๐Ÿค” Think about the problem setting
๐Ÿงฌ Generate data
๐Ÿค Choose the right base model
๐Ÿ† Design reward functions (and experiencing reward hacking)
๐Ÿ”„ Run multiple rounds of training, hoping that my model would learn something.

A fun and rewarding ๐Ÿ˜„ experience.


I learned a lot of things, that I want to share with you. ๐Ÿ‘‡
โœ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
๐Ÿ’ป Code: https://github.com/anakin87/qwen-scheduler-grpo
๐Ÿค— Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837

Love how you stepped off the beaten path and tackled a more practical, structured challenge training a model to build schedules is no small feat! GRPO really shines in these setups where reward design becomes the art. Your experience captures the real spirit of experimentation: data crafting, reward tuning, and watching models learn from scratch. Curious to hear how your reward functions evolved especially how you handled reward hacking. Great work!

you have done because DeepSeek boom ??but you can try with GSMK8
,but it can take time to do Original Problem ....but with GRPO using GSMK8,more effective in large problems(Like Large LLM's )....GRPO is really shines in these setups where reward design becomes very artful..But very Great work for what you did !nice work