Spaces:
Running
Running
H100 Lightweight Training Configuration Guide
This guide explains the new H100 Lightweight (Rapid) training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset.
๐ฏ Overview
The H100 Lightweight configuration is designed for:
- Rapid experimentation on H100 GPUs
- Efficient training with 80K carefully selected samples
- Quick iteration for research and development
- Cost-effective training sessions
๐ Key Features
Optimized for H100
- Batch Size: 16 (larger than A100 configs)
- Gradient Accumulation: 4 (reduced for faster updates)
- Learning Rate: 8e-6 (slightly higher for rapid convergence)
- Sequence Length: 8192 (full context window)
Dataset Sampling
- Source: OpenHermes-FR dataset
- Sample Size: 80,000 random samples
- Validation: 1,000 samples (if available)
- Reproducibility: Fixed random seed (42)
Training Optimizations
- Warmup Steps: 50 (reduced for rapid training)
- Evaluation: Every 50 steps
- Logging: Every 5 steps
- Saving: Every 200 steps
- Checkpoints: Keep only 2 (save storage)
๐ Configuration Details
Model Configuration
model_name="HuggingFaceTB/SmolLM3-3B"
max_seq_length=8192
use_flash_attention=True
use_gradient_checkpointing=True
Training Parameters
batch_size=16
gradient_accumulation_steps=4
learning_rate=8e-6
warmup_steps=50
max_epochs=1
H100-Specific Optimizations
dataloader_num_workers=4
dataloader_pin_memory=True
gradient_clipping=1.0
group_by_length=True
pad_to_multiple_of=8
Memory Optimizations
save_total_limit=2
early_stopping_patience=3
max_grad_norm=1.0
warmup_ratio=0.1
๐ง Usage
Interactive Selection
./launch.sh
# Select "H100 Lightweight (Rapid)" when prompted
Expected Training Time
- H100: ~2-4 hours (depending on hardware)
- A100: ~4-6 hours
- V100: ~6-8 hours
Memory Requirements
- GPU Memory: 40GB+ (H100 recommended)
- System RAM: 32GB+
- Storage: 50GB+ for dataset and checkpoints
๐ Performance Characteristics
Training Speed
- Steps per Second: ~2-3 (on H100)
- Samples per Second: ~32-48
- Effective Batch Size: 64 (16 ร 4)
Convergence
- Expected Loss: 1.2-1.8 (after 1 epoch)
- Evaluation Frequency: Every 50 steps
- Early Stopping: After 3 evaluations without improvement
Dataset Efficiency
- 80K samples: ~1.3% of full OpenHermes-FR
- Random sampling: Ensures diversity
- Fixed seed: Reproducible results
๐ฏ Use Cases
Perfect For
- Rapid prototyping of new ideas
- Hyperparameter tuning experiments
- Model comparison studies
- Research validation before full training
- Educational purposes and learning
Not Recommended For
- Production models (use Multiple Passes instead)
- Competition submissions (use full dataset)
- Research papers (use complete training)
๐ Comparison with Other Configurations
Configuration | Dataset Size | Batch Size | Epochs | Training Time | Use Case |
---|---|---|---|---|---|
Basic Training | Full SmolTalk | 2 | 3 | 6-8 hours | Learning |
H100 Lightweight | 80K Hermes-FR | 16 | 1 | 2-4 hours | Rapid experiments |
A100 Large Scale | Full Hermes-FR | 8 | 1.3 | 8-12 hours | Serious research |
Multiple Passes | Full Hermes-FR | 6 | 4 | 24-36 hours | Production |
๐ ๏ธ Customization
Modifying Sample Size
# In the launch script, you can modify:
DATASET_SAMPLE_SIZE=50000 # For 50K samples
DATASET_SAMPLE_SIZE=100000 # For 100K samples
Adjusting Training Parameters
# Modify in config/train_smollm3_h100_lightweight.py:
batch_size=12 # Smaller batch size
learning_rate=6e-6 # Lower learning rate
warmup_steps=100 # More warmup steps
Changing Dataset
# Modify the dataset name in the configuration:
dataset_name="your-custom-dataset"
๐ Monitoring and Results
Trackio Integration
- Real-time metrics: Loss, learning rate, gradient norm
- Training curves: Visual progress tracking
- Resource usage: GPU utilization, memory consumption
- Artifacts: Model checkpoints, logs
Expected Metrics
- Training Loss: Starts ~3.0, ends ~1.5
- Validation Loss: Should be close to training loss
- Learning Rate: Cosine decay from 8e-6 to 2e-6
- Gradient Norm: Should stay below 1.0
Success Indicators
- Converging loss: Steady decrease over time
- Stable gradients: Consistent gradient norms
- Good validation: Validation loss follows training loss
- No overfitting: Validation loss doesn't increase
๐จ Troubleshooting
Common Issues
Out of Memory (OOM)
# Reduce batch size in config:
batch_size=12 # Instead of 16
gradient_accumulation_steps=6 # Instead of 4
Slow Training
# Check GPU utilization:
nvidia-smi
# Ensure CUDA is properly installed
python -c "import torch; print(torch.cuda.is_available())"
Poor Convergence
# Try different learning rate:
learning_rate=6e-6 # Instead of 8e-6
# Or increase warmup:
warmup_steps=100 # Instead of 50
Dataset Issues
# Check dataset loading:
python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))"
Performance Tips
- Use H100 if available: Significantly faster than A100
- Monitor GPU memory: Keep utilization below 90%
- Check logs regularly: Look for convergence issues
- Save checkpoints: Don't lose progress
- Use early stopping: Prevent overfitting
๐ Example Workflow
Complete H100 Lightweight Training
# 1. Setup
python setup_launch.py
# 2. Check requirements
python check_requirements.py
# 3. Run interactive pipeline
./launch.sh
# 4. Select configuration
# Choose: "H100 Lightweight (Rapid)"
# 5. Monitor training
# Watch Trackio Space for real-time progress
# 6. Check results
# Model will be pushed to HF Hub
# Summary in training_summary.md
Expected Output
โ
Dataset prepared: 80000 train samples, 1000 validation samples
๐ Training started with 5000 total steps
โฑ๏ธ Estimated time: 2-4 hours
๐ Monitor progress at: https://huggingface.co/spaces/...
๐ Benefits
Speed
- 3-4x faster than full dataset training
- Rapid iteration for research
- Quick validation of ideas
Efficiency
- Reduced costs (less GPU time)
- Lower storage requirements
- Faster experimentation cycle
Quality
- Still high quality results
- Good for prototyping
- Suitable for many use cases
๐ฎ Future Enhancements
Planned Improvements
- Adaptive sampling: Smart dataset selection
- Multi-GPU support: Distributed training
- Advanced monitoring: More detailed metrics
- Auto-tuning: Automatic hyperparameter optimization
Extensibility
- Custom datasets: Easy integration
- Different models: Support for other architectures
- Advanced sampling: Stratified, balanced sampling
Happy Rapid Training on H100! ๐