HuangXinBa
/

GRPO

@@ -1,27 +1,28 @@
 ---
 license: apache-2.0
 language: en
 tags:
-- text-generation
-- causal-lm
-- reinforcement-learning
-- GRPO
-- instruction-tuning
-- chain-of-thought
-- trl
-- grpo
 datasets:
-- gsm8k
 pipeline_tag: text-generation
 widget:
-- text: What is 27 plus 16? Let's think step by step.
 ---
-# GRPO: Finetuned Causal Language Model using Generalized Reinforcement Policy Optimization
 ## 🧠 Model Description
-**GRPO** is a causal language model fine-tuned using the **GRPO (Generalized Reinforcement Policy Optimization)** algorithm, a variant of PPO optimized for reward-guided instruction following. This model was aligned for structured outputs in **Chain-of-Thought (CoT)** reasoning tasks.
 - **Base model**: `HuggingFaceTB/SmolLM-135M-Instruct`
 - **Language**: English

 ---
 license: apache-2.0
 language: en
 tags:
+  - text-generation
+  - causal-lm
+  - reinforcement-learning
+  - GRPO
+  - instruction-tuning
+  - chain-of-thought
 datasets:
+  - gsm8k
 pipeline_tag: text-generation
 widget:
+  - text: "What is 27 plus 16? Let's think step by step."
 ---
+# GRPO: Finetuned Causal Language Model using Group Relative Policy Optimization
 ## 🧠 Model Description
+**GRPO** is a causal language model fine-tuned using **Group Relative Policy Optimization (GRPO)**, a reinforcement learning algorithm built on PPO that optimizes language models via groupwise reward comparisons. This approach aligns model outputs with reward functions through relative ranking among multiple completions per prompt, making it well-suited for structured generation tasks such as **Chain-of-Thought (CoT)** reasoning.
 - **Base model**: `HuggingFaceTB/SmolLM-135M-Instruct`
 - **Language**: English