Spaces:
Running
Running
Cloud Training Guide for OpenHermes-FR Dataset
This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the legmlai/openhermes-fr dataset.
Overview
The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers:
- ✅ Cloud Instance Setup - Complete environment configuration
- ✅ Dataset Integration - Automatic loading and filtering
- ✅ Training Configuration - Optimized for French instruction tuning
- ✅ Monitoring Integration - Trackio experiment tracking
- ✅ Model Deployment - Push to Hugging Face Hub
Dataset Information
Schema
{
"prompt": "Explique la différence entre la photosynthèse C3 et C4.",
"accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)",
"bad_prompt_detected": false,
"bad_response_detected": false,
"bad_entry": false
}
Key Features
- Size: 799,875 examples (~1.4GB)
- Language: 100% French
- Quality: GPT-4o generated responses with automatic filtering
- License: ODC-BY 1.0
Cloud Instance Setup
1. Choose Your Cloud Provider
AWS EC2 (Recommended)
# Launch instance with GPU
# Recommended: g4dn.xlarge or g5.xlarge
# AMI: Deep Learning AMI (Ubuntu 20.04)
Google Cloud Platform
# Launch instance with GPU
# Recommended: n1-standard-4 with Tesla T4 or V100
Azure
# Launch instance with GPU
# Recommended: Standard_NC6s_v3 or Standard_NC12s_v3
2. Instance Specifications
Minimum Requirements
- GPU: 16GB+ VRAM (Tesla T4, V100, or A100)
- RAM: 32GB+ system memory
- Storage: 100GB+ SSD
- CPU: 8+ cores
Recommended Specifications
- GPU: A100 (40GB) or H100 (80GB)
- RAM: 64GB+ system memory
- Storage: 200GB+ NVMe SSD
- CPU: 16+ cores
3. Environment Setup
# Update system
sudo apt update && sudo apt upgrade -y
# Install CUDA (if not pre-installed)
# Follow NVIDIA CUDA installation guide for your GPU
# Install Python dependencies
sudo apt install python3-pip python3-venv git -y
# Create virtual environment
python3 -m venv smollm3_env
source smollm3_env/bin/activate
# Clone repository
git clone <your-repo-url>
cd <your-repo-directory>
# Install dependencies
pip install -r requirements.txt
# Install additional dependencies for cloud training
pip install accelerate transformers datasets huggingface_hub
Training Configuration
1. Use the OpenHermes-FR Config
The repository includes a specialized configuration for the OpenHermes-FR dataset:
python train.py config/train_smollm3_openhermes_fr.py \
--enable_tracking \
--trackio_url "https://your-space.hf.space" \
--experiment_name "smollm3_fr_openhermes_v1"
2. Configuration Details
The config/train_smollm3_openhermes_fr.py
includes:
Dataset Configuration
dataset_name: str = "legmlai/openhermes-fr"
dataset_split: str = "train"
input_field: str = "prompt"
target_field: str = "accepted_completion"
filter_bad_entries: bool = True
bad_entry_field: str = "bad_entry"
Training Optimization
batch_size: int = 2 # Reduced for French text (longer sequences)
gradient_accumulation_steps: int = 8 # Maintains effective batch size
learning_rate: float = 1e-5 # Lower for instruction tuning
max_iters: int = 2000 # More iterations for large dataset
Monitoring Integration
enable_tracking: bool = True
experiment_name: str = "smollm3_openhermes_fr"
Training Commands
Basic Training
python train.py config/train_smollm3_openhermes_fr.py
Training with Monitoring
python train.py config/train_smollm3_openhermes_fr.py \
--enable_tracking \
--trackio_url "https://your-trackio-space.hf.space" \
--experiment_name "smollm3_fr_openhermes_v1"
Training with Custom Parameters
python train.py config/train_smollm3_openhermes_fr.py \
--batch_size 4 \
--learning_rate 2e-5 \
--max_iters 3000 \
--enable_tracking \
--trackio_url "https://your-trackio-space.hf.space" \
--experiment_name "smollm3_fr_high_lr"
Training with Checkpoint Resume
python train.py config/train_smollm3_openhermes_fr.py \
--init_from resume \
--enable_tracking \
--trackio_url "https://your-trackio-space.hf.space" \
--experiment_name "smollm3_fr_resume"
Dataset Processing
Automatic Filtering
The training script automatically:
- ✅ Loads the OpenHermes-FR dataset from Hugging Face
- ✅ Filters out bad entries (
bad_entry = true
) - ✅ Splits data into train/validation/test (98/1/1)
- ✅ Formats prompts and completions for instruction tuning
Manual Dataset Inspection
from datasets import load_dataset
# Load dataset
dataset = load_dataset("legmlai/openhermes-fr")
# Check dataset info
print(f"Dataset size: {len(dataset['train'])}")
print(f"Sample columns: {dataset['train'].column_names}")
# Check filtering
bad_entries = dataset['train'].filter(lambda x: x['bad_entry'])
print(f"Bad entries: {len(bad_entries)}")
# Sample data
sample = dataset['train'][0]
print(f"Prompt: {sample['prompt']}")
print(f"Completion: {sample['accepted_completion']}")
Monitoring and Tracking
Trackio Integration
The training automatically logs:
- Training metrics: Loss, accuracy, learning rate
- System metrics: GPU memory, CPU usage
- Dataset info: Size, filtering statistics
- Model checkpoints: Regular saves with metadata
View Training Progress
- Trackio Space: Visit your Trackio Space URL
- Experiment Details: Check the "View Experiments" tab
- Metrics: Monitor loss curves and system usage
- Logs: Download training logs for analysis
Model Deployment
Push to Hugging Face Hub
After training, deploy your model:
python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \
--trackio-url "https://your-trackio-space.hf.space" \
--experiment-name "smollm3_fr_openhermes_v1"
Use Your Model
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load your fine-tuned model
model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes")
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes")
# Generate French text
prompt = "Expliquez le concept de l'intelligence artificielle."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Performance Optimization
GPU Memory Management
# Monitor GPU usage
nvidia-smi -l 1
# Optimize for your GPU
# For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8
# For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4
# For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2
Training Speed
# Use mixed precision (enabled by default)
fp16: bool = True
# Enable gradient checkpointing (enabled by default)
use_gradient_checkpointing: bool = True
# Use flash attention (enabled by default)
use_flash_attention: bool = True
Troubleshooting
Common Issues
1. Out of Memory (OOM)
# Reduce batch size
python train.py config/train_smollm3_openhermes_fr.py --batch_size 1
# Increase gradient accumulation
# Edit config: gradient_accumulation_steps = 16
2. Slow Training
# Check GPU utilization
nvidia-smi
# Verify data loading
# Check if dataset is cached locally
3. Dataset Loading Issues
# Clear cache
rm -rf ~/.cache/huggingface/
# Check internet connection
# Verify dataset name: "legmlai/openhermes-fr"
4. Monitoring Connection Issues
# Test Trackio connection
curl -I https://your-trackio-space.hf.space
# Check token permissions
# Verify experiment name format
Debug Mode
# Enable debug logging
export LOG_LEVEL=DEBUG
python train.py config/train_smollm3_openhermes_fr.py
Cost Optimization
Cloud Provider Tips
AWS EC2
- Use Spot Instances for cost savings
- Monitor usage with CloudWatch
- Use appropriate instance types
Google Cloud Platform
- Use Preemptible VMs for non-critical training
- Monitor with Cloud Monitoring
- Use committed use discounts
Azure
- Use Spot VMs for cost optimization
- Monitor with Azure Monitor
- Use reserved instances for long training
Training Time Estimates
GPU Type | Batch Size | Estimated Time |
---|---|---|
Tesla T4 (16GB) | 2 | 8-12 hours |
V100 (32GB) | 4 | 4-6 hours |
A100 (40GB) | 8 | 2-3 hours |
H100 (80GB) | 16 | 1-2 hours |
Security Best Practices
Token Management
# Use environment variables
export HF_TOKEN="your_token_here"
export TRACKIO_TOKEN="your_trackio_token"
# Don't hardcode in scripts
# Use IAM roles when possible
Data Privacy
# Use private repositories for sensitive models
python push_to_huggingface.py model username/private-model --private
# Secure your cloud instance
# Use VPC and security groups
Complete Workflow Example
1. Setup Cloud Instance
# Launch GPU instance
# Install dependencies
git clone <your-repo>
cd <your-repo>
pip install -r requirements.txt
2. Train Model
python train.py config/train_smollm3_openhermes_fr.py \
--enable_tracking \
--trackio_url "https://your-space.hf.space" \
--experiment_name "smollm3_fr_v1"
3. Deploy Model
python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \
--trackio-url "https://your-space.hf.space" \
--experiment-name "smollm3_fr_v1"
4. Test Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1")
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1")
# Test French generation
prompt = "Qu'est-ce que l'apprentissage automatique?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Support and Resources
Documentation
Community
Examples
Conclusion
This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud:
- ✅ Complete Setup - From cloud instance to model deployment
- ✅ Optimized Configuration - Tailored for French instruction tuning
- ✅ Monitoring Integration - Trackio experiment tracking
- ✅ Cost Optimization - Tips for efficient cloud usage
- ✅ Troubleshooting - Solutions for common issues
Start training your French language model today!