SmolFactory / docs /CLOUD_TRAINING_GUIDE.md
Tonic's picture
adds formatting fix
ebe598e verified
|
raw
history blame
11.5 kB
# Cloud Training Guide for OpenHermes-FR Dataset
This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the [legmlai/openhermes-fr](https://huggingface.co/datasets/legmlai/openhermes-fr) dataset.
## Overview
The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers:
-**Cloud Instance Setup** - Complete environment configuration
-**Dataset Integration** - Automatic loading and filtering
-**Training Configuration** - Optimized for French instruction tuning
-**Monitoring Integration** - Trackio experiment tracking
-**Model Deployment** - Push to Hugging Face Hub
## Dataset Information
### Schema
```json
{
"prompt": "Explique la différence entre la photosynthèse C3 et C4.",
"accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)",
"bad_prompt_detected": false,
"bad_response_detected": false,
"bad_entry": false
}
```
### Key Features
- **Size**: 799,875 examples (~1.4GB)
- **Language**: 100% French
- **Quality**: GPT-4o generated responses with automatic filtering
- **License**: ODC-BY 1.0
## Cloud Instance Setup
### 1. Choose Your Cloud Provider
#### **AWS EC2 (Recommended)**
```bash
# Launch instance with GPU
# Recommended: g4dn.xlarge or g5.xlarge
# AMI: Deep Learning AMI (Ubuntu 20.04)
```
#### **Google Cloud Platform**
```bash
# Launch instance with GPU
# Recommended: n1-standard-4 with Tesla T4 or V100
```
#### **Azure**
```bash
# Launch instance with GPU
# Recommended: Standard_NC6s_v3 or Standard_NC12s_v3
```
### 2. Instance Specifications
#### **Minimum Requirements**
- **GPU**: 16GB+ VRAM (Tesla T4, V100, or A100)
- **RAM**: 32GB+ system memory
- **Storage**: 100GB+ SSD
- **CPU**: 8+ cores
#### **Recommended Specifications**
- **GPU**: A100 (40GB) or H100 (80GB)
- **RAM**: 64GB+ system memory
- **Storage**: 200GB+ NVMe SSD
- **CPU**: 16+ cores
### 3. Environment Setup
```bash
# Update system
sudo apt update && sudo apt upgrade -y
# Install CUDA (if not pre-installed)
# Follow NVIDIA CUDA installation guide for your GPU
# Install Python dependencies
sudo apt install python3-pip python3-venv git -y
# Create virtual environment
python3 -m venv smollm3_env
source smollm3_env/bin/activate
# Clone repository
git clone <your-repo-url>
cd <your-repo-directory>
# Install dependencies
pip install -r requirements.txt
# Install additional dependencies for cloud training
pip install accelerate transformers datasets huggingface_hub
```
## Training Configuration
### 1. Use the OpenHermes-FR Config
The repository includes a specialized configuration for the OpenHermes-FR dataset:
```bash
python train.py config/train_smollm3_openhermes_fr.py \
--enable_tracking \
--trackio_url "https://your-space.hf.space" \
--experiment_name "smollm3_fr_openhermes_v1"
```
### 2. Configuration Details
The `config/train_smollm3_openhermes_fr.py` includes:
#### **Dataset Configuration**
```python
dataset_name: str = "legmlai/openhermes-fr"
dataset_split: str = "train"
input_field: str = "prompt"
target_field: str = "accepted_completion"
filter_bad_entries: bool = True
bad_entry_field: str = "bad_entry"
```
#### **Training Optimization**
```python
batch_size: int = 2 # Reduced for French text (longer sequences)
gradient_accumulation_steps: int = 8 # Maintains effective batch size
learning_rate: float = 1e-5 # Lower for instruction tuning
max_iters: int = 2000 # More iterations for large dataset
```
#### **Monitoring Integration**
```python
enable_tracking: bool = True
experiment_name: str = "smollm3_openhermes_fr"
```
## Training Commands
### Basic Training
```bash
python train.py config/train_smollm3_openhermes_fr.py
```
### Training with Monitoring
```bash
python train.py config/train_smollm3_openhermes_fr.py \
--enable_tracking \
--trackio_url "https://your-trackio-space.hf.space" \
--experiment_name "smollm3_fr_openhermes_v1"
```
### Training with Custom Parameters
```bash
python train.py config/train_smollm3_openhermes_fr.py \
--batch_size 4 \
--learning_rate 2e-5 \
--max_iters 3000 \
--enable_tracking \
--trackio_url "https://your-trackio-space.hf.space" \
--experiment_name "smollm3_fr_high_lr"
```
### Training with Checkpoint Resume
```bash
python train.py config/train_smollm3_openhermes_fr.py \
--init_from resume \
--enable_tracking \
--trackio_url "https://your-trackio-space.hf.space" \
--experiment_name "smollm3_fr_resume"
```
## Dataset Processing
### Automatic Filtering
The training script automatically:
-**Loads** the OpenHermes-FR dataset from Hugging Face
-**Filters** out bad entries (`bad_entry = true`)
-**Splits** data into train/validation/test (98/1/1)
-**Formats** prompts and completions for instruction tuning
### Manual Dataset Inspection
```python
from datasets import load_dataset
# Load dataset
dataset = load_dataset("legmlai/openhermes-fr")
# Check dataset info
print(f"Dataset size: {len(dataset['train'])}")
print(f"Sample columns: {dataset['train'].column_names}")
# Check filtering
bad_entries = dataset['train'].filter(lambda x: x['bad_entry'])
print(f"Bad entries: {len(bad_entries)}")
# Sample data
sample = dataset['train'][0]
print(f"Prompt: {sample['prompt']}")
print(f"Completion: {sample['accepted_completion']}")
```
## Monitoring and Tracking
### Trackio Integration
The training automatically logs:
- **Training metrics**: Loss, accuracy, learning rate
- **System metrics**: GPU memory, CPU usage
- **Dataset info**: Size, filtering statistics
- **Model checkpoints**: Regular saves with metadata
### View Training Progress
1. **Trackio Space**: Visit your Trackio Space URL
2. **Experiment Details**: Check the "View Experiments" tab
3. **Metrics**: Monitor loss curves and system usage
4. **Logs**: Download training logs for analysis
## Model Deployment
### Push to Hugging Face Hub
After training, deploy your model:
```bash
python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \
--trackio-url "https://your-trackio-space.hf.space" \
--experiment-name "smollm3_fr_openhermes_v1"
```
### Use Your Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load your fine-tuned model
model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes")
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes")
# Generate French text
prompt = "Expliquez le concept de l'intelligence artificielle."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Performance Optimization
### GPU Memory Management
```bash
# Monitor GPU usage
nvidia-smi -l 1
# Optimize for your GPU
# For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8
# For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4
# For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2
```
### Training Speed
```bash
# Use mixed precision (enabled by default)
fp16: bool = True
# Enable gradient checkpointing (enabled by default)
use_gradient_checkpointing: bool = True
# Use flash attention (enabled by default)
use_flash_attention: bool = True
```
## Troubleshooting
### Common Issues
#### 1. **Out of Memory (OOM)**
```bash
# Reduce batch size
python train.py config/train_smollm3_openhermes_fr.py --batch_size 1
# Increase gradient accumulation
# Edit config: gradient_accumulation_steps = 16
```
#### 2. **Slow Training**
```bash
# Check GPU utilization
nvidia-smi
# Verify data loading
# Check if dataset is cached locally
```
#### 3. **Dataset Loading Issues**
```bash
# Clear cache
rm -rf ~/.cache/huggingface/
# Check internet connection
# Verify dataset name: "legmlai/openhermes-fr"
```
#### 4. **Monitoring Connection Issues**
```bash
# Test Trackio connection
curl -I https://your-trackio-space.hf.space
# Check token permissions
# Verify experiment name format
```
### Debug Mode
```bash
# Enable debug logging
export LOG_LEVEL=DEBUG
python train.py config/train_smollm3_openhermes_fr.py
```
## Cost Optimization
### Cloud Provider Tips
#### **AWS EC2**
- Use Spot Instances for cost savings
- Monitor usage with CloudWatch
- Use appropriate instance types
#### **Google Cloud Platform**
- Use Preemptible VMs for non-critical training
- Monitor with Cloud Monitoring
- Use committed use discounts
#### **Azure**
- Use Spot VMs for cost optimization
- Monitor with Azure Monitor
- Use reserved instances for long training
### Training Time Estimates
| GPU Type | Batch Size | Estimated Time |
|----------|------------|----------------|
| Tesla T4 (16GB) | 2 | 8-12 hours |
| V100 (32GB) | 4 | 4-6 hours |
| A100 (40GB) | 8 | 2-3 hours |
| H100 (80GB) | 16 | 1-2 hours |
## Security Best Practices
### Token Management
```bash
# Use environment variables
export HF_TOKEN="your_token_here"
export TRACKIO_TOKEN="your_trackio_token"
# Don't hardcode in scripts
# Use IAM roles when possible
```
### Data Privacy
```bash
# Use private repositories for sensitive models
python push_to_huggingface.py model username/private-model --private
# Secure your cloud instance
# Use VPC and security groups
```
## Complete Workflow Example
### 1. Setup Cloud Instance
```bash
# Launch GPU instance
# Install dependencies
git clone <your-repo>
cd <your-repo>
pip install -r requirements.txt
```
### 2. Train Model
```bash
python train.py config/train_smollm3_openhermes_fr.py \
--enable_tracking \
--trackio_url "https://your-space.hf.space" \
--experiment_name "smollm3_fr_v1"
```
### 3. Deploy Model
```bash
python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \
--trackio-url "https://your-space.hf.space" \
--experiment-name "smollm3_fr_v1"
```
### 4. Test Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1")
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1")
# Test French generation
prompt = "Qu'est-ce que l'apprentissage automatique?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Support and Resources
### Documentation
- [OpenHermes-FR Dataset](https://huggingface.co/datasets/legmlai/openhermes-fr)
- [SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- [Trackio Monitoring](https://github.com/Josephrp/trackio)
### Community
- [Hugging Face Forums](https://discuss.huggingface.co/)
- [Transformers Documentation](https://huggingface.co/docs/transformers/)
### Examples
- [French Language Models](https://huggingface.co/models?search=french)
- [Instruction Tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
## Conclusion
This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud:
-**Complete Setup** - From cloud instance to model deployment
-**Optimized Configuration** - Tailored for French instruction tuning
-**Monitoring Integration** - Trackio experiment tracking
-**Cost Optimization** - Tips for efficient cloud usage
-**Troubleshooting** - Solutions for common issues
Start training your French language model today!