File size: 7,430 Bytes
ebe598e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# H100 Lightweight Training Configuration Guide

This guide explains the new **H100 Lightweight (Rapid)** training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset.

## ๐ŸŽฏ Overview

The H100 Lightweight configuration is designed for:
- **Rapid experimentation** on H100 GPUs
- **Efficient training** with 80K carefully selected samples
- **Quick iteration** for research and development
- **Cost-effective** training sessions

## ๐Ÿš€ Key Features

### **Optimized for H100**
- **Batch Size**: 16 (larger than A100 configs)
- **Gradient Accumulation**: 4 (reduced for faster updates)
- **Learning Rate**: 8e-6 (slightly higher for rapid convergence)
- **Sequence Length**: 8192 (full context window)

### **Dataset Sampling**
- **Source**: OpenHermes-FR dataset
- **Sample Size**: 80,000 random samples
- **Validation**: 1,000 samples (if available)
- **Reproducibility**: Fixed random seed (42)

### **Training Optimizations**
- **Warmup Steps**: 50 (reduced for rapid training)
- **Evaluation**: Every 50 steps
- **Logging**: Every 5 steps
- **Saving**: Every 200 steps
- **Checkpoints**: Keep only 2 (save storage)

## ๐Ÿ“Š Configuration Details

### **Model Configuration**
```python
model_name="HuggingFaceTB/SmolLM3-3B"
max_seq_length=8192
use_flash_attention=True
use_gradient_checkpointing=True
```

### **Training Parameters**
```python
batch_size=16
gradient_accumulation_steps=4
learning_rate=8e-6
warmup_steps=50
max_epochs=1
```

### **H100-Specific Optimizations**
```python
dataloader_num_workers=4
dataloader_pin_memory=True
gradient_clipping=1.0
group_by_length=True
pad_to_multiple_of=8
```

### **Memory Optimizations**
```python
save_total_limit=2
early_stopping_patience=3
max_grad_norm=1.0
warmup_ratio=0.1
```

## ๐Ÿ”ง Usage

### **Interactive Selection**
```bash
./launch.sh
# Select "H100 Lightweight (Rapid)" when prompted
```

### **Expected Training Time**
- **H100**: ~2-4 hours (depending on hardware)
- **A100**: ~4-6 hours
- **V100**: ~6-8 hours

### **Memory Requirements**
- **GPU Memory**: 40GB+ (H100 recommended)
- **System RAM**: 32GB+
- **Storage**: 50GB+ for dataset and checkpoints

## ๐Ÿ“ˆ Performance Characteristics

### **Training Speed**
- **Steps per Second**: ~2-3 (on H100)
- **Samples per Second**: ~32-48
- **Effective Batch Size**: 64 (16 ร— 4)

### **Convergence**
- **Expected Loss**: 1.2-1.8 (after 1 epoch)
- **Evaluation Frequency**: Every 50 steps
- **Early Stopping**: After 3 evaluations without improvement

### **Dataset Efficiency**
- **80K samples**: ~1.3% of full OpenHermes-FR
- **Random sampling**: Ensures diversity
- **Fixed seed**: Reproducible results

## ๐ŸŽฏ Use Cases

### **Perfect For**
- **Rapid prototyping** of new ideas
- **Hyperparameter tuning** experiments
- **Model comparison** studies
- **Research validation** before full training
- **Educational purposes** and learning

### **Not Recommended For**
- **Production models** (use Multiple Passes instead)
- **Competition submissions** (use full dataset)
- **Research papers** (use complete training)

## ๐Ÿ”„ Comparison with Other Configurations

| Configuration | Dataset Size | Batch Size | Epochs | Training Time | Use Case |
|---------------|--------------|------------|--------|---------------|----------|
| **Basic Training** | Full SmolTalk | 2 | 3 | 6-8 hours | Learning |
| **H100 Lightweight** | 80K Hermes-FR | 16 | 1 | 2-4 hours | Rapid experiments |
| **A100 Large Scale** | Full Hermes-FR | 8 | 1.3 | 8-12 hours | Serious research |
| **Multiple Passes** | Full Hermes-FR | 6 | 4 | 24-36 hours | Production |

## ๐Ÿ› ๏ธ Customization

### **Modifying Sample Size**
```bash
# In the launch script, you can modify:
DATASET_SAMPLE_SIZE=50000  # For 50K samples
DATASET_SAMPLE_SIZE=100000 # For 100K samples
```

### **Adjusting Training Parameters**
```bash
# Modify in config/train_smollm3_h100_lightweight.py:
batch_size=12              # Smaller batch size
learning_rate=6e-6         # Lower learning rate
warmup_steps=100          # More warmup steps
```

### **Changing Dataset**
```bash
# Modify the dataset name in the configuration:
dataset_name="your-custom-dataset"
```

## ๐Ÿ“Š Monitoring and Results

### **Trackio Integration**
- **Real-time metrics**: Loss, learning rate, gradient norm
- **Training curves**: Visual progress tracking
- **Resource usage**: GPU utilization, memory consumption
- **Artifacts**: Model checkpoints, logs

### **Expected Metrics**
- **Training Loss**: Starts ~3.0, ends ~1.5
- **Validation Loss**: Should be close to training loss
- **Learning Rate**: Cosine decay from 8e-6 to 2e-6
- **Gradient Norm**: Should stay below 1.0

### **Success Indicators**
- **Converging loss**: Steady decrease over time
- **Stable gradients**: Consistent gradient norms
- **Good validation**: Validation loss follows training loss
- **No overfitting**: Validation loss doesn't increase

## ๐Ÿšจ Troubleshooting

### **Common Issues**

#### **Out of Memory (OOM)**
```bash
# Reduce batch size in config:
batch_size=12  # Instead of 16
gradient_accumulation_steps=6  # Instead of 4
```

#### **Slow Training**
```bash
# Check GPU utilization:
nvidia-smi
# Ensure CUDA is properly installed
python -c "import torch; print(torch.cuda.is_available())"
```

#### **Poor Convergence**
```bash
# Try different learning rate:
learning_rate=6e-6  # Instead of 8e-6
# Or increase warmup:
warmup_steps=100   # Instead of 50
```

#### **Dataset Issues**
```bash
# Check dataset loading:
python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))"
```

### **Performance Tips**

1. **Use H100 if available**: Significantly faster than A100
2. **Monitor GPU memory**: Keep utilization below 90%
3. **Check logs regularly**: Look for convergence issues
4. **Save checkpoints**: Don't lose progress
5. **Use early stopping**: Prevent overfitting

## ๐Ÿ“‹ Example Workflow

### **Complete H100 Lightweight Training**
```bash
# 1. Setup
python setup_launch.py

# 2. Check requirements
python check_requirements.py

# 3. Run interactive pipeline
./launch.sh

# 4. Select configuration
# Choose: "H100 Lightweight (Rapid)"

# 5. Monitor training
# Watch Trackio Space for real-time progress

# 6. Check results
# Model will be pushed to HF Hub
# Summary in training_summary.md
```

### **Expected Output**
```
โœ… Dataset prepared: 80000 train samples, 1000 validation samples
๐Ÿ“ˆ Training started with 5000 total steps
โฑ๏ธ Estimated time: 2-4 hours
๐Ÿ“Š Monitor progress at: https://huggingface.co/spaces/...
```

## ๐ŸŽ‰ Benefits

### **Speed**
- **3-4x faster** than full dataset training
- **Rapid iteration** for research
- **Quick validation** of ideas

### **Efficiency**
- **Reduced costs** (less GPU time)
- **Lower storage** requirements
- **Faster experimentation** cycle

### **Quality**
- **Still high quality** results
- **Good for prototyping**
- **Suitable for many use cases**

## ๐Ÿ”ฎ Future Enhancements

### **Planned Improvements**
- **Adaptive sampling**: Smart dataset selection
- **Multi-GPU support**: Distributed training
- **Advanced monitoring**: More detailed metrics
- **Auto-tuning**: Automatic hyperparameter optimization

### **Extensibility**
- **Custom datasets**: Easy integration
- **Different models**: Support for other architectures
- **Advanced sampling**: Stratified, balanced sampling

---

**Happy Rapid Training on H100! ๐Ÿš€**