Spaces:
Running
Running
File size: 5,959 Bytes
5fe83da |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
# A100 Large Scale Training Guide
This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.
## Available Configurations
### 1. A100 Large Batch Configuration
**File**: `config/train_smollm3_openhermes_fr_a100_large.py`
**Key Features**:
- **Effective Batch Size**: 128 (8 × 16 gradient accumulation)
- **Training Duration**: ~1.3 passes (8,000 steps)
- **Learning Rate**: 5e-6 (optimized for large batches)
- **Mixed Precision**: bf16 (A100 optimized)
- **Sequence Length**: 8192 tokens
- **Memory Optimizations**: No gradient checkpointing for A100 efficiency
**Estimated Training Time**: ~6-8 hours on A100
### 2. Multiple Passes Configuration
**File**: `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`
**Key Features**:
- **Effective Batch Size**: 120 (6 × 20 gradient accumulation)
- **Training Duration**: ~4 passes (25,000 steps)
- **Learning Rate**: 3e-6 (conservative for long training)
- **Warmup Steps**: 2000 (longer warmup for stability)
- **Checkpoint Strategy**: More frequent saves (every 2000 steps)
**Estimated Training Time**: ~20-24 hours on A100
## Training Commands
### Quick Start - Large Batch Experiment
```bash
python run_a100_large_experiment.py \
--config config/train_smollm3_openhermes_fr_a100_large.py \
--experiment-name "smollm3_openhermes_fr_large_batch" \
--output-dir ./outputs/large_batch
```
### Multiple Passes Experiment
```bash
python run_a100_large_experiment.py \
--config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
--experiment-name "smollm3_openhermes_fr_multiple_passes" \
--output-dir ./outputs/multiple_passes
```
### Dry Run (Check Configuration)
```bash
python run_a100_large_experiment.py \
--config config/train_smollm3_openhermes_fr_a100_large.py \
--dry-run
```
### Resume Training
```bash
python run_a100_large_experiment.py \
--config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
--resume ./outputs/multiple_passes/checkpoint-10000 \
--output-dir ./outputs/multiple_passes
```
## Configuration Details
### Memory Usage Optimization
- **Gradient Checkpointing**: Disabled for A100 efficiency
- **Flash Attention**: Enabled for memory efficiency
- **bf16 Mixed Precision**: Better for A100 than fp16
- **Gradient Clipping**: 1.0 for stability
- **Group by Length**: Enabled for better batching
### Data Loading Optimization
- **Num Workers**: 8 for faster data loading
- **Pin Memory**: Enabled for GPU transfer efficiency
- **Prefetch Factor**: 2 for pipeline optimization
### Training Stability
- **Conservative Learning Rate**: Lower LR for large effective batch sizes
- **Longer Warmup**: More warmup steps for stability
- **Higher Beta2**: 0.999 for AdamW stability
- **Gradient Clipping**: Prevents gradient explosion
## Expected Results
### Large Batch Configuration (1.3 passes)
- **Training Steps**: 8,000
- **Effective Batch Size**: 128
- **Steps per Epoch**: ~6,250
- **Epochs**: ~1.3
- **Expected Loss**: Should converge to ~1.5-2.0
### Multiple Passes Configuration (4 passes)
- **Training Steps**: 25,000
- **Effective Batch Size**: 120
- **Steps per Epoch**: ~6,667
- **Epochs**: ~3.75
- **Expected Loss**: Should converge to ~1.2-1.5
## Monitoring and Logging
### Trackio Integration
Both configurations include Trackio monitoring:
- **Metrics Logging**: Every 25-50 steps
- **Artifact Logging**: Model checkpoints
- **Config Logging**: Training configuration
### Checkpoint Strategy
- **Large Batch**: Save every 1000 steps (8 checkpoints)
- **Multiple Passes**: Save every 2000 steps (12 checkpoints)
- **Best Model**: Automatically load best model at end
## Hardware Requirements
### Minimum Requirements
- **GPU**: A100 80GB (or multiple A100s)
- **RAM**: 64GB+ system RAM
- **Storage**: 100GB+ for checkpoints and logs
- **Network**: Fast internet for dataset download
### Recommended Setup
- **GPU**: 2-4x A100 80GB
- **RAM**: 128GB+ system RAM
- **Storage**: 500GB+ NVMe SSD
- **Network**: 10Gbps+ connection
## Troubleshooting
### Out of Memory (OOM)
If you encounter OOM errors:
1. Reduce `batch_size` from 8 to 6 or 4
2. Increase `gradient_accumulation_steps` to maintain effective batch size
3. Reduce `max_seq_length` from 8192 to 4096
### Slow Training
If training is too slow:
1. Increase `dataloader_num_workers` to 12-16
2. Ensure you're using bf16 mixed precision
3. Check that gradient checkpointing is disabled
4. Verify flash attention is enabled
### Convergence Issues
If loss doesn't converge:
1. Reduce learning rate by 2x
2. Increase warmup steps
3. Check gradient norms in logs
4. Verify dataset quality
## Customization
### For Different Dataset Sizes
Adjust `max_iters` based on your dataset size:
```python
# For 1M datapoints with effective batch size 120
steps_per_epoch = 1000000 // 120 # ~8,333 steps
max_iters = steps_per_epoch * desired_epochs
```
### For Different GPU Memory
Adjust batch size and gradient accumulation:
```python
# For 40GB A100
batch_size = 4
gradient_accumulation_steps = 32 # Effective batch size = 128
# For 24GB GPU
batch_size = 2
gradient_accumulation_steps = 64 # Effective batch size = 128
```
## Performance Tips
1. **Use bf16**: Better than fp16 for A100
2. **Disable Gradient Checkpointing**: A100 has enough memory
3. **Use Flash Attention**: Memory efficient attention
4. **Group by Length**: Better batching efficiency
5. **Pin Memory**: Faster GPU transfers
6. **Multiple Workers**: Faster data loading
## Expected Timeline
- **Large Batch**: 6-8 hours for 1.3 passes
- **Multiple Passes**: 20-24 hours for 4 passes
- **Full Dataset (5+ passes)**: 30+ hours
## Next Steps
After training completes:
1. Evaluate on validation set
2. Test generation quality
3. Push to Hugging Face Hub
4. Deploy for inference
For deployment instructions, see `DEPLOYMENT_GUIDE.md`. |