Update README.md
Browse files
README.md
CHANGED
@@ -144,15 +144,13 @@ For training data details, please see the [Dolma](https://huggingface.co/dataset
|
|
144 |
|
145 |
### Hyperparameters
|
146 |
|
147 |
-
The hyperparameters for
|
148 |
-
Certainly! Here's the table with SFT and DPO as rows:
|
149 |
|
150 |
| | Learning Rate | Beta | Epochs | Warmup | Weight Decay | Gradient Clipping | Maximum Sequence Length |
|
151 |
|-------------------------|---------------|------|--------|------------------------------------------------------------------------|--------------|-------------------|-------------------------|
|
152 |
| **SFT** | 2 × 10^-6 | N/A | 3 | Linear warmup for the first 3% of total training time, then cooldown to 0 | 0 | 0 | 2048 |
|
153 |
-
| **DPO** | 5 × 10^-7 | 0.1 | 3 | Linear warmup for the first 10% of total training time, then cooldown to 0| 0 | 0 | 2048 |
|
154 |
|
155 |
-
Compared to Tulu 2,
|
156 |
|
157 |
## Bias, Risks, and Limitations
|
158 |
|
|
|
144 |
|
145 |
### Hyperparameters
|
146 |
|
147 |
+
The hyperparameters for SFT training are below:
|
|
|
148 |
|
149 |
| | Learning Rate | Beta | Epochs | Warmup | Weight Decay | Gradient Clipping | Maximum Sequence Length |
|
150 |
|-------------------------|---------------|------|--------|------------------------------------------------------------------------|--------------|-------------------|-------------------------|
|
151 |
| **SFT** | 2 × 10^-6 | N/A | 3 | Linear warmup for the first 3% of total training time, then cooldown to 0 | 0 | 0 | 2048 |
|
|
|
152 |
|
153 |
+
Compared to Tulu 2, SFT uses a lower LR, 3 epochs instead of 2, and 2048 length instead of 8192.
|
154 |
|
155 |
## Bias, Risks, and Limitations
|
156 |
|