SmolFactory / docs /H100_LIGHTWEIGHT_GUIDE.md
Tonic's picture
merge tonic into main for major refactor
42f4411 verified
|
raw
history blame
7.43 kB

H100 Lightweight Training Configuration Guide

This guide explains the new H100 Lightweight (Rapid) training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset.

๐ŸŽฏ Overview

The H100 Lightweight configuration is designed for:

  • Rapid experimentation on H100 GPUs
  • Efficient training with 80K carefully selected samples
  • Quick iteration for research and development
  • Cost-effective training sessions

๐Ÿš€ Key Features

Optimized for H100

  • Batch Size: 16 (larger than A100 configs)
  • Gradient Accumulation: 4 (reduced for faster updates)
  • Learning Rate: 8e-6 (slightly higher for rapid convergence)
  • Sequence Length: 8192 (full context window)

Dataset Sampling

  • Source: OpenHermes-FR dataset
  • Sample Size: 80,000 random samples
  • Validation: 1,000 samples (if available)
  • Reproducibility: Fixed random seed (42)

Training Optimizations

  • Warmup Steps: 50 (reduced for rapid training)
  • Evaluation: Every 50 steps
  • Logging: Every 5 steps
  • Saving: Every 200 steps
  • Checkpoints: Keep only 2 (save storage)

๐Ÿ“Š Configuration Details

Model Configuration

model_name="HuggingFaceTB/SmolLM3-3B"
max_seq_length=8192
use_flash_attention=True
use_gradient_checkpointing=True

Training Parameters

batch_size=16
gradient_accumulation_steps=4
learning_rate=8e-6
warmup_steps=50
max_epochs=1

H100-Specific Optimizations

dataloader_num_workers=4
dataloader_pin_memory=True
gradient_clipping=1.0
group_by_length=True
pad_to_multiple_of=8

Memory Optimizations

save_total_limit=2
early_stopping_patience=3
max_grad_norm=1.0
warmup_ratio=0.1

๐Ÿ”ง Usage

Interactive Selection

./launch.sh
# Select "H100 Lightweight (Rapid)" when prompted

Expected Training Time

  • H100: ~2-4 hours (depending on hardware)
  • A100: ~4-6 hours
  • V100: ~6-8 hours

Memory Requirements

  • GPU Memory: 40GB+ (H100 recommended)
  • System RAM: 32GB+
  • Storage: 50GB+ for dataset and checkpoints

๐Ÿ“ˆ Performance Characteristics

Training Speed

  • Steps per Second: ~2-3 (on H100)
  • Samples per Second: ~32-48
  • Effective Batch Size: 64 (16 ร— 4)

Convergence

  • Expected Loss: 1.2-1.8 (after 1 epoch)
  • Evaluation Frequency: Every 50 steps
  • Early Stopping: After 3 evaluations without improvement

Dataset Efficiency

  • 80K samples: ~1.3% of full OpenHermes-FR
  • Random sampling: Ensures diversity
  • Fixed seed: Reproducible results

๐ŸŽฏ Use Cases

Perfect For

  • Rapid prototyping of new ideas
  • Hyperparameter tuning experiments
  • Model comparison studies
  • Research validation before full training
  • Educational purposes and learning

Not Recommended For

  • Production models (use Multiple Passes instead)
  • Competition submissions (use full dataset)
  • Research papers (use complete training)

๐Ÿ”„ Comparison with Other Configurations

Configuration Dataset Size Batch Size Epochs Training Time Use Case
Basic Training Full SmolTalk 2 3 6-8 hours Learning
H100 Lightweight 80K Hermes-FR 16 1 2-4 hours Rapid experiments
A100 Large Scale Full Hermes-FR 8 1.3 8-12 hours Serious research
Multiple Passes Full Hermes-FR 6 4 24-36 hours Production

๐Ÿ› ๏ธ Customization

Modifying Sample Size

# In the launch script, you can modify:
DATASET_SAMPLE_SIZE=50000  # For 50K samples
DATASET_SAMPLE_SIZE=100000 # For 100K samples

Adjusting Training Parameters

# Modify in config/train_smollm3_h100_lightweight.py:
batch_size=12              # Smaller batch size
learning_rate=6e-6         # Lower learning rate
warmup_steps=100          # More warmup steps

Changing Dataset

# Modify the dataset name in the configuration:
dataset_name="your-custom-dataset"

๐Ÿ“Š Monitoring and Results

Trackio Integration

  • Real-time metrics: Loss, learning rate, gradient norm
  • Training curves: Visual progress tracking
  • Resource usage: GPU utilization, memory consumption
  • Artifacts: Model checkpoints, logs

Expected Metrics

  • Training Loss: Starts ~3.0, ends ~1.5
  • Validation Loss: Should be close to training loss
  • Learning Rate: Cosine decay from 8e-6 to 2e-6
  • Gradient Norm: Should stay below 1.0

Success Indicators

  • Converging loss: Steady decrease over time
  • Stable gradients: Consistent gradient norms
  • Good validation: Validation loss follows training loss
  • No overfitting: Validation loss doesn't increase

๐Ÿšจ Troubleshooting

Common Issues

Out of Memory (OOM)

# Reduce batch size in config:
batch_size=12  # Instead of 16
gradient_accumulation_steps=6  # Instead of 4

Slow Training

# Check GPU utilization:
nvidia-smi
# Ensure CUDA is properly installed
python -c "import torch; print(torch.cuda.is_available())"

Poor Convergence

# Try different learning rate:
learning_rate=6e-6  # Instead of 8e-6
# Or increase warmup:
warmup_steps=100   # Instead of 50

Dataset Issues

# Check dataset loading:
python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))"

Performance Tips

  1. Use H100 if available: Significantly faster than A100
  2. Monitor GPU memory: Keep utilization below 90%
  3. Check logs regularly: Look for convergence issues
  4. Save checkpoints: Don't lose progress
  5. Use early stopping: Prevent overfitting

๐Ÿ“‹ Example Workflow

Complete H100 Lightweight Training

# 1. Setup
python setup_launch.py

# 2. Check requirements
python check_requirements.py

# 3. Run interactive pipeline
./launch.sh

# 4. Select configuration
# Choose: "H100 Lightweight (Rapid)"

# 5. Monitor training
# Watch Trackio Space for real-time progress

# 6. Check results
# Model will be pushed to HF Hub
# Summary in training_summary.md

Expected Output

โœ… Dataset prepared: 80000 train samples, 1000 validation samples
๐Ÿ“ˆ Training started with 5000 total steps
โฑ๏ธ Estimated time: 2-4 hours
๐Ÿ“Š Monitor progress at: https://huggingface.co/spaces/...

๐ŸŽ‰ Benefits

Speed

  • 3-4x faster than full dataset training
  • Rapid iteration for research
  • Quick validation of ideas

Efficiency

  • Reduced costs (less GPU time)
  • Lower storage requirements
  • Faster experimentation cycle

Quality

  • Still high quality results
  • Good for prototyping
  • Suitable for many use cases

๐Ÿ”ฎ Future Enhancements

Planned Improvements

  • Adaptive sampling: Smart dataset selection
  • Multi-GPU support: Distributed training
  • Advanced monitoring: More detailed metrics
  • Auto-tuning: Automatic hyperparameter optimization

Extensibility

  • Custom datasets: Easy integration
  • Different models: Support for other architectures
  • Advanced sampling: Stratified, balanced sampling

Happy Rapid Training on H100! ๐Ÿš€