Add full model card (README.md)
Browse files
README.md
CHANGED
@@ -1,27 +1,28 @@
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
language: en
|
4 |
tags:
|
5 |
-
- text-generation
|
6 |
-
- causal-lm
|
7 |
-
- reinforcement-learning
|
8 |
-
- GRPO
|
9 |
-
- instruction-tuning
|
10 |
-
- chain-of-thought
|
11 |
-
- trl
|
12 |
-
- grpo
|
13 |
datasets:
|
14 |
-
- gsm8k
|
15 |
pipeline_tag: text-generation
|
16 |
widget:
|
17 |
-
- text: What is 27 plus 16? Let's think step by step.
|
18 |
---
|
19 |
|
20 |
-
# GRPO: Finetuned Causal Language Model using
|
|
|
21 |
|
22 |
## 🧠 Model Description
|
23 |
|
24 |
-
**GRPO** is a causal language model fine-tuned using
|
|
|
25 |
|
26 |
- **Base model**: `HuggingFaceTB/SmolLM-135M-Instruct`
|
27 |
- **Language**: English
|
|
|
1 |
+
|
2 |
---
|
3 |
license: apache-2.0
|
4 |
language: en
|
5 |
tags:
|
6 |
+
- text-generation
|
7 |
+
- causal-lm
|
8 |
+
- reinforcement-learning
|
9 |
+
- GRPO
|
10 |
+
- instruction-tuning
|
11 |
+
- chain-of-thought
|
|
|
|
|
12 |
datasets:
|
13 |
+
- gsm8k
|
14 |
pipeline_tag: text-generation
|
15 |
widget:
|
16 |
+
- text: "What is 27 plus 16? Let's think step by step."
|
17 |
---
|
18 |
|
19 |
+
# GRPO: Finetuned Causal Language Model using Group Relative Policy Optimization
|
20 |
+
|
21 |
|
22 |
## 🧠 Model Description
|
23 |
|
24 |
+
**GRPO** is a causal language model fine-tuned using **Group Relative Policy Optimization (GRPO)**, a reinforcement learning algorithm built on PPO that optimizes language models via groupwise reward comparisons. This approach aligns model outputs with reward functions through relative ranking among multiple completions per prompt, making it well-suited for structured generation tasks such as **Chain-of-Thought (CoT)** reasoning.
|
25 |
+
|
26 |
|
27 |
- **Base model**: `HuggingFaceTB/SmolLM-135M-Instruct`
|
28 |
- **Language**: English
|