Spaces:
Running
Running
File size: 7,430 Bytes
ebe598e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 |
# H100 Lightweight Training Configuration Guide
This guide explains the new **H100 Lightweight (Rapid)** training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset.
## ๐ฏ Overview
The H100 Lightweight configuration is designed for:
- **Rapid experimentation** on H100 GPUs
- **Efficient training** with 80K carefully selected samples
- **Quick iteration** for research and development
- **Cost-effective** training sessions
## ๐ Key Features
### **Optimized for H100**
- **Batch Size**: 16 (larger than A100 configs)
- **Gradient Accumulation**: 4 (reduced for faster updates)
- **Learning Rate**: 8e-6 (slightly higher for rapid convergence)
- **Sequence Length**: 8192 (full context window)
### **Dataset Sampling**
- **Source**: OpenHermes-FR dataset
- **Sample Size**: 80,000 random samples
- **Validation**: 1,000 samples (if available)
- **Reproducibility**: Fixed random seed (42)
### **Training Optimizations**
- **Warmup Steps**: 50 (reduced for rapid training)
- **Evaluation**: Every 50 steps
- **Logging**: Every 5 steps
- **Saving**: Every 200 steps
- **Checkpoints**: Keep only 2 (save storage)
## ๐ Configuration Details
### **Model Configuration**
```python
model_name="HuggingFaceTB/SmolLM3-3B"
max_seq_length=8192
use_flash_attention=True
use_gradient_checkpointing=True
```
### **Training Parameters**
```python
batch_size=16
gradient_accumulation_steps=4
learning_rate=8e-6
warmup_steps=50
max_epochs=1
```
### **H100-Specific Optimizations**
```python
dataloader_num_workers=4
dataloader_pin_memory=True
gradient_clipping=1.0
group_by_length=True
pad_to_multiple_of=8
```
### **Memory Optimizations**
```python
save_total_limit=2
early_stopping_patience=3
max_grad_norm=1.0
warmup_ratio=0.1
```
## ๐ง Usage
### **Interactive Selection**
```bash
./launch.sh
# Select "H100 Lightweight (Rapid)" when prompted
```
### **Expected Training Time**
- **H100**: ~2-4 hours (depending on hardware)
- **A100**: ~4-6 hours
- **V100**: ~6-8 hours
### **Memory Requirements**
- **GPU Memory**: 40GB+ (H100 recommended)
- **System RAM**: 32GB+
- **Storage**: 50GB+ for dataset and checkpoints
## ๐ Performance Characteristics
### **Training Speed**
- **Steps per Second**: ~2-3 (on H100)
- **Samples per Second**: ~32-48
- **Effective Batch Size**: 64 (16 ร 4)
### **Convergence**
- **Expected Loss**: 1.2-1.8 (after 1 epoch)
- **Evaluation Frequency**: Every 50 steps
- **Early Stopping**: After 3 evaluations without improvement
### **Dataset Efficiency**
- **80K samples**: ~1.3% of full OpenHermes-FR
- **Random sampling**: Ensures diversity
- **Fixed seed**: Reproducible results
## ๐ฏ Use Cases
### **Perfect For**
- **Rapid prototyping** of new ideas
- **Hyperparameter tuning** experiments
- **Model comparison** studies
- **Research validation** before full training
- **Educational purposes** and learning
### **Not Recommended For**
- **Production models** (use Multiple Passes instead)
- **Competition submissions** (use full dataset)
- **Research papers** (use complete training)
## ๐ Comparison with Other Configurations
| Configuration | Dataset Size | Batch Size | Epochs | Training Time | Use Case |
|---------------|--------------|------------|--------|---------------|----------|
| **Basic Training** | Full SmolTalk | 2 | 3 | 6-8 hours | Learning |
| **H100 Lightweight** | 80K Hermes-FR | 16 | 1 | 2-4 hours | Rapid experiments |
| **A100 Large Scale** | Full Hermes-FR | 8 | 1.3 | 8-12 hours | Serious research |
| **Multiple Passes** | Full Hermes-FR | 6 | 4 | 24-36 hours | Production |
## ๐ ๏ธ Customization
### **Modifying Sample Size**
```bash
# In the launch script, you can modify:
DATASET_SAMPLE_SIZE=50000 # For 50K samples
DATASET_SAMPLE_SIZE=100000 # For 100K samples
```
### **Adjusting Training Parameters**
```bash
# Modify in config/train_smollm3_h100_lightweight.py:
batch_size=12 # Smaller batch size
learning_rate=6e-6 # Lower learning rate
warmup_steps=100 # More warmup steps
```
### **Changing Dataset**
```bash
# Modify the dataset name in the configuration:
dataset_name="your-custom-dataset"
```
## ๐ Monitoring and Results
### **Trackio Integration**
- **Real-time metrics**: Loss, learning rate, gradient norm
- **Training curves**: Visual progress tracking
- **Resource usage**: GPU utilization, memory consumption
- **Artifacts**: Model checkpoints, logs
### **Expected Metrics**
- **Training Loss**: Starts ~3.0, ends ~1.5
- **Validation Loss**: Should be close to training loss
- **Learning Rate**: Cosine decay from 8e-6 to 2e-6
- **Gradient Norm**: Should stay below 1.0
### **Success Indicators**
- **Converging loss**: Steady decrease over time
- **Stable gradients**: Consistent gradient norms
- **Good validation**: Validation loss follows training loss
- **No overfitting**: Validation loss doesn't increase
## ๐จ Troubleshooting
### **Common Issues**
#### **Out of Memory (OOM)**
```bash
# Reduce batch size in config:
batch_size=12 # Instead of 16
gradient_accumulation_steps=6 # Instead of 4
```
#### **Slow Training**
```bash
# Check GPU utilization:
nvidia-smi
# Ensure CUDA is properly installed
python -c "import torch; print(torch.cuda.is_available())"
```
#### **Poor Convergence**
```bash
# Try different learning rate:
learning_rate=6e-6 # Instead of 8e-6
# Or increase warmup:
warmup_steps=100 # Instead of 50
```
#### **Dataset Issues**
```bash
# Check dataset loading:
python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))"
```
### **Performance Tips**
1. **Use H100 if available**: Significantly faster than A100
2. **Monitor GPU memory**: Keep utilization below 90%
3. **Check logs regularly**: Look for convergence issues
4. **Save checkpoints**: Don't lose progress
5. **Use early stopping**: Prevent overfitting
## ๐ Example Workflow
### **Complete H100 Lightweight Training**
```bash
# 1. Setup
python setup_launch.py
# 2. Check requirements
python check_requirements.py
# 3. Run interactive pipeline
./launch.sh
# 4. Select configuration
# Choose: "H100 Lightweight (Rapid)"
# 5. Monitor training
# Watch Trackio Space for real-time progress
# 6. Check results
# Model will be pushed to HF Hub
# Summary in training_summary.md
```
### **Expected Output**
```
โ
Dataset prepared: 80000 train samples, 1000 validation samples
๐ Training started with 5000 total steps
โฑ๏ธ Estimated time: 2-4 hours
๐ Monitor progress at: https://huggingface.co/spaces/...
```
## ๐ Benefits
### **Speed**
- **3-4x faster** than full dataset training
- **Rapid iteration** for research
- **Quick validation** of ideas
### **Efficiency**
- **Reduced costs** (less GPU time)
- **Lower storage** requirements
- **Faster experimentation** cycle
### **Quality**
- **Still high quality** results
- **Good for prototyping**
- **Suitable for many use cases**
## ๐ฎ Future Enhancements
### **Planned Improvements**
- **Adaptive sampling**: Smart dataset selection
- **Multi-GPU support**: Distributed training
- **Advanced monitoring**: More detailed metrics
- **Auto-tuning**: Automatic hyperparameter optimization
### **Extensibility**
- **Custom datasets**: Easy integration
- **Different models**: Support for other architectures
- **Advanced sampling**: Stratified, balanced sampling
---
**Happy Rapid Training on H100! ๐** |