Spaces:
Running
Running
A100 Large Scale Training Guide
This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.
Available Configurations
1. A100 Large Batch Configuration
File: config/train_smollm3_openhermes_fr_a100_large.py
Key Features:
- Effective Batch Size: 128 (8 × 16 gradient accumulation)
- Training Duration: ~1.3 passes (8,000 steps)
- Learning Rate: 5e-6 (optimized for large batches)
- Mixed Precision: bf16 (A100 optimized)
- Sequence Length: 8192 tokens
- Memory Optimizations: No gradient checkpointing for A100 efficiency
Estimated Training Time: ~6-8 hours on A100
2. Multiple Passes Configuration
File: config/train_smollm3_openhermes_fr_a100_multiple_passes.py
Key Features:
- Effective Batch Size: 120 (6 × 20 gradient accumulation)
- Training Duration: ~4 passes (25,000 steps)
- Learning Rate: 3e-6 (conservative for long training)
- Warmup Steps: 2000 (longer warmup for stability)
- Checkpoint Strategy: More frequent saves (every 2000 steps)
Estimated Training Time: ~20-24 hours on A100
Training Commands
Quick Start - Large Batch Experiment
python run_a100_large_experiment.py \
--config config/train_smollm3_openhermes_fr_a100_large.py \
--experiment-name "smollm3_openhermes_fr_large_batch" \
--output-dir ./outputs/large_batch
Multiple Passes Experiment
python run_a100_large_experiment.py \
--config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
--experiment-name "smollm3_openhermes_fr_multiple_passes" \
--output-dir ./outputs/multiple_passes
Dry Run (Check Configuration)
python run_a100_large_experiment.py \
--config config/train_smollm3_openhermes_fr_a100_large.py \
--dry-run
Resume Training
python run_a100_large_experiment.py \
--config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
--resume ./outputs/multiple_passes/checkpoint-10000 \
--output-dir ./outputs/multiple_passes
Configuration Details
Memory Usage Optimization
- Gradient Checkpointing: Disabled for A100 efficiency
- Flash Attention: Enabled for memory efficiency
- bf16 Mixed Precision: Better for A100 than fp16
- Gradient Clipping: 1.0 for stability
- Group by Length: Enabled for better batching
Data Loading Optimization
- Num Workers: 8 for faster data loading
- Pin Memory: Enabled for GPU transfer efficiency
- Prefetch Factor: 2 for pipeline optimization
Training Stability
- Conservative Learning Rate: Lower LR for large effective batch sizes
- Longer Warmup: More warmup steps for stability
- Higher Beta2: 0.999 for AdamW stability
- Gradient Clipping: Prevents gradient explosion
Expected Results
Large Batch Configuration (1.3 passes)
- Training Steps: 8,000
- Effective Batch Size: 128
- Steps per Epoch: ~6,250
- Epochs: ~1.3
- Expected Loss: Should converge to ~1.5-2.0
Multiple Passes Configuration (4 passes)
- Training Steps: 25,000
- Effective Batch Size: 120
- Steps per Epoch: ~6,667
- Epochs: ~3.75
- Expected Loss: Should converge to ~1.2-1.5
Monitoring and Logging
Trackio Integration
Both configurations include Trackio monitoring:
- Metrics Logging: Every 25-50 steps
- Artifact Logging: Model checkpoints
- Config Logging: Training configuration
Checkpoint Strategy
- Large Batch: Save every 1000 steps (8 checkpoints)
- Multiple Passes: Save every 2000 steps (12 checkpoints)
- Best Model: Automatically load best model at end
Hardware Requirements
Minimum Requirements
- GPU: A100 80GB (or multiple A100s)
- RAM: 64GB+ system RAM
- Storage: 100GB+ for checkpoints and logs
- Network: Fast internet for dataset download
Recommended Setup
- GPU: 2-4x A100 80GB
- RAM: 128GB+ system RAM
- Storage: 500GB+ NVMe SSD
- Network: 10Gbps+ connection
Troubleshooting
Out of Memory (OOM)
If you encounter OOM errors:
- Reduce
batch_size
from 8 to 6 or 4 - Increase
gradient_accumulation_steps
to maintain effective batch size - Reduce
max_seq_length
from 8192 to 4096
Slow Training
If training is too slow:
- Increase
dataloader_num_workers
to 12-16 - Ensure you're using bf16 mixed precision
- Check that gradient checkpointing is disabled
- Verify flash attention is enabled
Convergence Issues
If loss doesn't converge:
- Reduce learning rate by 2x
- Increase warmup steps
- Check gradient norms in logs
- Verify dataset quality
Customization
For Different Dataset Sizes
Adjust max_iters
based on your dataset size:
# For 1M datapoints with effective batch size 120
steps_per_epoch = 1000000 // 120 # ~8,333 steps
max_iters = steps_per_epoch * desired_epochs
For Different GPU Memory
Adjust batch size and gradient accumulation:
# For 40GB A100
batch_size = 4
gradient_accumulation_steps = 32 # Effective batch size = 128
# For 24GB GPU
batch_size = 2
gradient_accumulation_steps = 64 # Effective batch size = 128
Performance Tips
- Use bf16: Better than fp16 for A100
- Disable Gradient Checkpointing: A100 has enough memory
- Use Flash Attention: Memory efficient attention
- Group by Length: Better batching efficiency
- Pin Memory: Faster GPU transfers
- Multiple Workers: Faster data loading
Expected Timeline
- Large Batch: 6-8 hours for 1.3 passes
- Multiple Passes: 20-24 hours for 4 passes
- Full Dataset (5+ passes): 30+ hours
Next Steps
After training completes:
- Evaluate on validation set
- Test generation quality
- Push to Hugging Face Hub
- Deploy for inference
For deployment instructions, see DEPLOYMENT_GUIDE.md
.