Spaces:
Running
Running
| # H100 Lightweight Training Configuration Guide | |
| This guide explains the new **H100 Lightweight (Rapid)** training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset. | |
| ## ๐ฏ Overview | |
| The H100 Lightweight configuration is designed for: | |
| - **Rapid experimentation** on H100 GPUs | |
| - **Efficient training** with 80K carefully selected samples | |
| - **Quick iteration** for research and development | |
| - **Cost-effective** training sessions | |
| ## ๐ Key Features | |
| ### **Optimized for H100** | |
| - **Batch Size**: 16 (larger than A100 configs) | |
| - **Gradient Accumulation**: 4 (reduced for faster updates) | |
| - **Learning Rate**: 8e-6 (slightly higher for rapid convergence) | |
| - **Sequence Length**: 8192 (full context window) | |
| ### **Dataset Sampling** | |
| - **Source**: OpenHermes-FR dataset | |
| - **Sample Size**: 80,000 random samples | |
| - **Validation**: 1,000 samples (if available) | |
| - **Reproducibility**: Fixed random seed (42) | |
| ### **Training Optimizations** | |
| - **Warmup Steps**: 50 (reduced for rapid training) | |
| - **Evaluation**: Every 50 steps | |
| - **Logging**: Every 5 steps | |
| - **Saving**: Every 200 steps | |
| - **Checkpoints**: Keep only 2 (save storage) | |
| ## ๐ Configuration Details | |
| ### **Model Configuration** | |
| ```python | |
| model_name="HuggingFaceTB/SmolLM3-3B" | |
| max_seq_length=8192 | |
| use_flash_attention=True | |
| use_gradient_checkpointing=True | |
| ``` | |
| ### **Training Parameters** | |
| ```python | |
| batch_size=16 | |
| gradient_accumulation_steps=4 | |
| learning_rate=8e-6 | |
| warmup_steps=50 | |
| max_epochs=1 | |
| ``` | |
| ### **H100-Specific Optimizations** | |
| ```python | |
| dataloader_num_workers=4 | |
| dataloader_pin_memory=True | |
| gradient_clipping=1.0 | |
| group_by_length=True | |
| pad_to_multiple_of=8 | |
| ``` | |
| ### **Memory Optimizations** | |
| ```python | |
| save_total_limit=2 | |
| early_stopping_patience=3 | |
| max_grad_norm=1.0 | |
| warmup_ratio=0.1 | |
| ``` | |
| ## ๐ง Usage | |
| ### **Interactive Selection** | |
| ```bash | |
| ./launch.sh | |
| # Select "H100 Lightweight (Rapid)" when prompted | |
| ``` | |
| ### **Expected Training Time** | |
| - **H100**: ~2-4 hours (depending on hardware) | |
| - **A100**: ~4-6 hours | |
| - **V100**: ~6-8 hours | |
| ### **Memory Requirements** | |
| - **GPU Memory**: 40GB+ (H100 recommended) | |
| - **System RAM**: 32GB+ | |
| - **Storage**: 50GB+ for dataset and checkpoints | |
| ## ๐ Performance Characteristics | |
| ### **Training Speed** | |
| - **Steps per Second**: ~2-3 (on H100) | |
| - **Samples per Second**: ~32-48 | |
| - **Effective Batch Size**: 64 (16 ร 4) | |
| ### **Convergence** | |
| - **Expected Loss**: 1.2-1.8 (after 1 epoch) | |
| - **Evaluation Frequency**: Every 50 steps | |
| - **Early Stopping**: After 3 evaluations without improvement | |
| ### **Dataset Efficiency** | |
| - **80K samples**: ~1.3% of full OpenHermes-FR | |
| - **Random sampling**: Ensures diversity | |
| - **Fixed seed**: Reproducible results | |
| ## ๐ฏ Use Cases | |
| ### **Perfect For** | |
| - **Rapid prototyping** of new ideas | |
| - **Hyperparameter tuning** experiments | |
| - **Model comparison** studies | |
| - **Research validation** before full training | |
| - **Educational purposes** and learning | |
| ### **Not Recommended For** | |
| - **Production models** (use Multiple Passes instead) | |
| - **Competition submissions** (use full dataset) | |
| - **Research papers** (use complete training) | |
| ## ๐ Comparison with Other Configurations | |
| | Configuration | Dataset Size | Batch Size | Epochs | Training Time | Use Case | | |
| |---------------|--------------|------------|--------|---------------|----------| | |
| | **Basic Training** | Full SmolTalk | 2 | 3 | 6-8 hours | Learning | | |
| | **H100 Lightweight** | 80K Hermes-FR | 16 | 1 | 2-4 hours | Rapid experiments | | |
| | **A100 Large Scale** | Full Hermes-FR | 8 | 1.3 | 8-12 hours | Serious research | | |
| | **Multiple Passes** | Full Hermes-FR | 6 | 4 | 24-36 hours | Production | | |
| ## ๐ ๏ธ Customization | |
| ### **Modifying Sample Size** | |
| ```bash | |
| # In the launch script, you can modify: | |
| DATASET_SAMPLE_SIZE=50000 # For 50K samples | |
| DATASET_SAMPLE_SIZE=100000 # For 100K samples | |
| ``` | |
| ### **Adjusting Training Parameters** | |
| ```bash | |
| # Modify in config/train_smollm3_h100_lightweight.py: | |
| batch_size=12 # Smaller batch size | |
| learning_rate=6e-6 # Lower learning rate | |
| warmup_steps=100 # More warmup steps | |
| ``` | |
| ### **Changing Dataset** | |
| ```bash | |
| # Modify the dataset name in the configuration: | |
| dataset_name="your-custom-dataset" | |
| ``` | |
| ## ๐ Monitoring and Results | |
| ### **Trackio Integration** | |
| - **Real-time metrics**: Loss, learning rate, gradient norm | |
| - **Training curves**: Visual progress tracking | |
| - **Resource usage**: GPU utilization, memory consumption | |
| - **Artifacts**: Model checkpoints, logs | |
| ### **Expected Metrics** | |
| - **Training Loss**: Starts ~3.0, ends ~1.5 | |
| - **Validation Loss**: Should be close to training loss | |
| - **Learning Rate**: Cosine decay from 8e-6 to 2e-6 | |
| - **Gradient Norm**: Should stay below 1.0 | |
| ### **Success Indicators** | |
| - **Converging loss**: Steady decrease over time | |
| - **Stable gradients**: Consistent gradient norms | |
| - **Good validation**: Validation loss follows training loss | |
| - **No overfitting**: Validation loss doesn't increase | |
| ## ๐จ Troubleshooting | |
| ### **Common Issues** | |
| #### **Out of Memory (OOM)** | |
| ```bash | |
| # Reduce batch size in config: | |
| batch_size=12 # Instead of 16 | |
| gradient_accumulation_steps=6 # Instead of 4 | |
| ``` | |
| #### **Slow Training** | |
| ```bash | |
| # Check GPU utilization: | |
| nvidia-smi | |
| # Ensure CUDA is properly installed | |
| python -c "import torch; print(torch.cuda.is_available())" | |
| ``` | |
| #### **Poor Convergence** | |
| ```bash | |
| # Try different learning rate: | |
| learning_rate=6e-6 # Instead of 8e-6 | |
| # Or increase warmup: | |
| warmup_steps=100 # Instead of 50 | |
| ``` | |
| #### **Dataset Issues** | |
| ```bash | |
| # Check dataset loading: | |
| python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))" | |
| ``` | |
| ### **Performance Tips** | |
| 1. **Use H100 if available**: Significantly faster than A100 | |
| 2. **Monitor GPU memory**: Keep utilization below 90% | |
| 3. **Check logs regularly**: Look for convergence issues | |
| 4. **Save checkpoints**: Don't lose progress | |
| 5. **Use early stopping**: Prevent overfitting | |
| ## ๐ Example Workflow | |
| ### **Complete H100 Lightweight Training** | |
| ```bash | |
| # 1. Setup | |
| python setup_launch.py | |
| # 2. Check requirements | |
| python check_requirements.py | |
| # 3. Run interactive pipeline | |
| ./launch.sh | |
| # 4. Select configuration | |
| # Choose: "H100 Lightweight (Rapid)" | |
| # 5. Monitor training | |
| # Watch Trackio Space for real-time progress | |
| # 6. Check results | |
| # Model will be pushed to HF Hub | |
| # Summary in training_summary.md | |
| ``` | |
| ### **Expected Output** | |
| ``` | |
| โ Dataset prepared: 80000 train samples, 1000 validation samples | |
| ๐ Training started with 5000 total steps | |
| โฑ๏ธ Estimated time: 2-4 hours | |
| ๐ Monitor progress at: https://huggingface.co/spaces/... | |
| ``` | |
| ## ๐ Benefits | |
| ### **Speed** | |
| - **3-4x faster** than full dataset training | |
| - **Rapid iteration** for research | |
| - **Quick validation** of ideas | |
| ### **Efficiency** | |
| - **Reduced costs** (less GPU time) | |
| - **Lower storage** requirements | |
| - **Faster experimentation** cycle | |
| ### **Quality** | |
| - **Still high quality** results | |
| - **Good for prototyping** | |
| - **Suitable for many use cases** | |
| ## ๐ฎ Future Enhancements | |
| ### **Planned Improvements** | |
| - **Adaptive sampling**: Smart dataset selection | |
| - **Multi-GPU support**: Distributed training | |
| - **Advanced monitoring**: More detailed metrics | |
| - **Auto-tuning**: Automatic hyperparameter optimization | |
| ### **Extensibility** | |
| - **Custom datasets**: Easy integration | |
| - **Different models**: Support for other architectures | |
| - **Advanced sampling**: Stratified, balanced sampling | |
| --- | |
| **Happy Rapid Training on H100! ๐** |