HuangXinBa commited on
Commit
3bbfa9a
·
verified ·
1 Parent(s): bc15fbb

Add full model card (README.md)

Browse files
Files changed (1) hide show
  1. README.md +13 -12
README.md CHANGED
@@ -1,27 +1,28 @@
 
1
  ---
2
  license: apache-2.0
3
  language: en
4
  tags:
5
- - text-generation
6
- - causal-lm
7
- - reinforcement-learning
8
- - GRPO
9
- - instruction-tuning
10
- - chain-of-thought
11
- - trl
12
- - grpo
13
  datasets:
14
- - gsm8k
15
  pipeline_tag: text-generation
16
  widget:
17
- - text: What is 27 plus 16? Let's think step by step.
18
  ---
19
 
20
- # GRPO: Finetuned Causal Language Model using Generalized Reinforcement Policy Optimization
 
21
 
22
  ## 🧠 Model Description
23
 
24
- **GRPO** is a causal language model fine-tuned using the **GRPO (Generalized Reinforcement Policy Optimization)** algorithm, a variant of PPO optimized for reward-guided instruction following. This model was aligned for structured outputs in **Chain-of-Thought (CoT)** reasoning tasks.
 
25
 
26
  - **Base model**: `HuggingFaceTB/SmolLM-135M-Instruct`
27
  - **Language**: English
 
1
+
2
  ---
3
  license: apache-2.0
4
  language: en
5
  tags:
6
+ - text-generation
7
+ - causal-lm
8
+ - reinforcement-learning
9
+ - GRPO
10
+ - instruction-tuning
11
+ - chain-of-thought
 
 
12
  datasets:
13
+ - gsm8k
14
  pipeline_tag: text-generation
15
  widget:
16
+ - text: "What is 27 plus 16? Let's think step by step."
17
  ---
18
 
19
+ # GRPO: Finetuned Causal Language Model using Group Relative Policy Optimization
20
+
21
 
22
  ## 🧠 Model Description
23
 
24
+ **GRPO** is a causal language model fine-tuned using **Group Relative Policy Optimization (GRPO)**, a reinforcement learning algorithm built on PPO that optimizes language models via groupwise reward comparisons. This approach aligns model outputs with reward functions through relative ranking among multiple completions per prompt, making it well-suited for structured generation tasks such as **Chain-of-Thought (CoT)** reasoning.
25
+
26
 
27
  - **Base model**: `HuggingFaceTB/SmolLM-135M-Instruct`
28
  - **Language**: English