Spaces:
Running
Running
# Cloud Training Guide for OpenHermes-FR Dataset | |
This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the [legmlai/openhermes-fr](https://huggingface.co/datasets/legmlai/openhermes-fr) dataset. | |
## Overview | |
The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers: | |
- ✅ **Cloud Instance Setup** - Complete environment configuration | |
- ✅ **Dataset Integration** - Automatic loading and filtering | |
- ✅ **Training Configuration** - Optimized for French instruction tuning | |
- ✅ **Monitoring Integration** - Trackio experiment tracking | |
- ✅ **Model Deployment** - Push to Hugging Face Hub | |
## Dataset Information | |
### Schema | |
```json | |
{ | |
"prompt": "Explique la différence entre la photosynthèse C3 et C4.", | |
"accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)", | |
"bad_prompt_detected": false, | |
"bad_response_detected": false, | |
"bad_entry": false | |
} | |
``` | |
### Key Features | |
- **Size**: 799,875 examples (~1.4GB) | |
- **Language**: 100% French | |
- **Quality**: GPT-4o generated responses with automatic filtering | |
- **License**: ODC-BY 1.0 | |
## Cloud Instance Setup | |
### 1. Choose Your Cloud Provider | |
#### **AWS EC2 (Recommended)** | |
```bash | |
# Launch instance with GPU | |
# Recommended: g4dn.xlarge or g5.xlarge | |
# AMI: Deep Learning AMI (Ubuntu 20.04) | |
``` | |
#### **Google Cloud Platform** | |
```bash | |
# Launch instance with GPU | |
# Recommended: n1-standard-4 with Tesla T4 or V100 | |
``` | |
#### **Azure** | |
```bash | |
# Launch instance with GPU | |
# Recommended: Standard_NC6s_v3 or Standard_NC12s_v3 | |
``` | |
### 2. Instance Specifications | |
#### **Minimum Requirements** | |
- **GPU**: 16GB+ VRAM (Tesla T4, V100, or A100) | |
- **RAM**: 32GB+ system memory | |
- **Storage**: 100GB+ SSD | |
- **CPU**: 8+ cores | |
#### **Recommended Specifications** | |
- **GPU**: A100 (40GB) or H100 (80GB) | |
- **RAM**: 64GB+ system memory | |
- **Storage**: 200GB+ NVMe SSD | |
- **CPU**: 16+ cores | |
### 3. Environment Setup | |
```bash | |
# Update system | |
sudo apt update && sudo apt upgrade -y | |
# Install CUDA (if not pre-installed) | |
# Follow NVIDIA CUDA installation guide for your GPU | |
# Install Python dependencies | |
sudo apt install python3-pip python3-venv git -y | |
# Create virtual environment | |
python3 -m venv smollm3_env | |
source smollm3_env/bin/activate | |
# Clone repository | |
git clone <your-repo-url> | |
cd <your-repo-directory> | |
# Install dependencies | |
pip install -r requirements.txt | |
# Install additional dependencies for cloud training | |
pip install accelerate transformers datasets huggingface_hub | |
``` | |
## Training Configuration | |
### 1. Use the OpenHermes-FR Config | |
The repository includes a specialized configuration for the OpenHermes-FR dataset: | |
```bash | |
python train.py config/train_smollm3_openhermes_fr.py \ | |
--enable_tracking \ | |
--trackio_url "https://your-space.hf.space" \ | |
--experiment_name "smollm3_fr_openhermes_v1" | |
``` | |
### 2. Configuration Details | |
The `config/train_smollm3_openhermes_fr.py` includes: | |
#### **Dataset Configuration** | |
```python | |
dataset_name: str = "legmlai/openhermes-fr" | |
dataset_split: str = "train" | |
input_field: str = "prompt" | |
target_field: str = "accepted_completion" | |
filter_bad_entries: bool = True | |
bad_entry_field: str = "bad_entry" | |
``` | |
#### **Training Optimization** | |
```python | |
batch_size: int = 2 # Reduced for French text (longer sequences) | |
gradient_accumulation_steps: int = 8 # Maintains effective batch size | |
learning_rate: float = 1e-5 # Lower for instruction tuning | |
max_iters: int = 2000 # More iterations for large dataset | |
``` | |
#### **Monitoring Integration** | |
```python | |
enable_tracking: bool = True | |
experiment_name: str = "smollm3_openhermes_fr" | |
``` | |
## Training Commands | |
### Basic Training | |
```bash | |
python train.py config/train_smollm3_openhermes_fr.py | |
``` | |
### Training with Monitoring | |
```bash | |
python train.py config/train_smollm3_openhermes_fr.py \ | |
--enable_tracking \ | |
--trackio_url "https://your-trackio-space.hf.space" \ | |
--experiment_name "smollm3_fr_openhermes_v1" | |
``` | |
### Training with Custom Parameters | |
```bash | |
python train.py config/train_smollm3_openhermes_fr.py \ | |
--batch_size 4 \ | |
--learning_rate 2e-5 \ | |
--max_iters 3000 \ | |
--enable_tracking \ | |
--trackio_url "https://your-trackio-space.hf.space" \ | |
--experiment_name "smollm3_fr_high_lr" | |
``` | |
### Training with Checkpoint Resume | |
```bash | |
python train.py config/train_smollm3_openhermes_fr.py \ | |
--init_from resume \ | |
--enable_tracking \ | |
--trackio_url "https://your-trackio-space.hf.space" \ | |
--experiment_name "smollm3_fr_resume" | |
``` | |
## Dataset Processing | |
### Automatic Filtering | |
The training script automatically: | |
- ✅ **Loads** the OpenHermes-FR dataset from Hugging Face | |
- ✅ **Filters** out bad entries (`bad_entry = true`) | |
- ✅ **Splits** data into train/validation/test (98/1/1) | |
- ✅ **Formats** prompts and completions for instruction tuning | |
### Manual Dataset Inspection | |
```python | |
from datasets import load_dataset | |
# Load dataset | |
dataset = load_dataset("legmlai/openhermes-fr") | |
# Check dataset info | |
print(f"Dataset size: {len(dataset['train'])}") | |
print(f"Sample columns: {dataset['train'].column_names}") | |
# Check filtering | |
bad_entries = dataset['train'].filter(lambda x: x['bad_entry']) | |
print(f"Bad entries: {len(bad_entries)}") | |
# Sample data | |
sample = dataset['train'][0] | |
print(f"Prompt: {sample['prompt']}") | |
print(f"Completion: {sample['accepted_completion']}") | |
``` | |
## Monitoring and Tracking | |
### Trackio Integration | |
The training automatically logs: | |
- **Training metrics**: Loss, accuracy, learning rate | |
- **System metrics**: GPU memory, CPU usage | |
- **Dataset info**: Size, filtering statistics | |
- **Model checkpoints**: Regular saves with metadata | |
### View Training Progress | |
1. **Trackio Space**: Visit your Trackio Space URL | |
2. **Experiment Details**: Check the "View Experiments" tab | |
3. **Metrics**: Monitor loss curves and system usage | |
4. **Logs**: Download training logs for analysis | |
## Model Deployment | |
### Push to Hugging Face Hub | |
After training, deploy your model: | |
```bash | |
python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \ | |
--trackio-url "https://your-trackio-space.hf.space" \ | |
--experiment-name "smollm3_fr_openhermes_v1" | |
``` | |
### Use Your Model | |
```python | |
from transformers import AutoModelForCausalLM, AutoTokenizer | |
# Load your fine-tuned model | |
model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes") | |
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes") | |
# Generate French text | |
prompt = "Expliquez le concept de l'intelligence artificielle." | |
inputs = tokenizer(prompt, return_tensors="pt") | |
outputs = model.generate(**inputs, max_new_tokens=200) | |
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
``` | |
## Performance Optimization | |
### GPU Memory Management | |
```bash | |
# Monitor GPU usage | |
nvidia-smi -l 1 | |
# Optimize for your GPU | |
# For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8 | |
# For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4 | |
# For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2 | |
``` | |
### Training Speed | |
```bash | |
# Use mixed precision (enabled by default) | |
fp16: bool = True | |
# Enable gradient checkpointing (enabled by default) | |
use_gradient_checkpointing: bool = True | |
# Use flash attention (enabled by default) | |
use_flash_attention: bool = True | |
``` | |
## Troubleshooting | |
### Common Issues | |
#### 1. **Out of Memory (OOM)** | |
```bash | |
# Reduce batch size | |
python train.py config/train_smollm3_openhermes_fr.py --batch_size 1 | |
# Increase gradient accumulation | |
# Edit config: gradient_accumulation_steps = 16 | |
``` | |
#### 2. **Slow Training** | |
```bash | |
# Check GPU utilization | |
nvidia-smi | |
# Verify data loading | |
# Check if dataset is cached locally | |
``` | |
#### 3. **Dataset Loading Issues** | |
```bash | |
# Clear cache | |
rm -rf ~/.cache/huggingface/ | |
# Check internet connection | |
# Verify dataset name: "legmlai/openhermes-fr" | |
``` | |
#### 4. **Monitoring Connection Issues** | |
```bash | |
# Test Trackio connection | |
curl -I https://your-trackio-space.hf.space | |
# Check token permissions | |
# Verify experiment name format | |
``` | |
### Debug Mode | |
```bash | |
# Enable debug logging | |
export LOG_LEVEL=DEBUG | |
python train.py config/train_smollm3_openhermes_fr.py | |
``` | |
## Cost Optimization | |
### Cloud Provider Tips | |
#### **AWS EC2** | |
- Use Spot Instances for cost savings | |
- Monitor usage with CloudWatch | |
- Use appropriate instance types | |
#### **Google Cloud Platform** | |
- Use Preemptible VMs for non-critical training | |
- Monitor with Cloud Monitoring | |
- Use committed use discounts | |
#### **Azure** | |
- Use Spot VMs for cost optimization | |
- Monitor with Azure Monitor | |
- Use reserved instances for long training | |
### Training Time Estimates | |
| GPU Type | Batch Size | Estimated Time | | |
|----------|------------|----------------| | |
| Tesla T4 (16GB) | 2 | 8-12 hours | | |
| V100 (32GB) | 4 | 4-6 hours | | |
| A100 (40GB) | 8 | 2-3 hours | | |
| H100 (80GB) | 16 | 1-2 hours | | |
## Security Best Practices | |
### Token Management | |
```bash | |
# Use environment variables | |
export HF_TOKEN="your_token_here" | |
export TRACKIO_TOKEN="your_trackio_token" | |
# Don't hardcode in scripts | |
# Use IAM roles when possible | |
``` | |
### Data Privacy | |
```bash | |
# Use private repositories for sensitive models | |
python push_to_huggingface.py model username/private-model --private | |
# Secure your cloud instance | |
# Use VPC and security groups | |
``` | |
## Complete Workflow Example | |
### 1. Setup Cloud Instance | |
```bash | |
# Launch GPU instance | |
# Install dependencies | |
git clone <your-repo> | |
cd <your-repo> | |
pip install -r requirements.txt | |
``` | |
### 2. Train Model | |
```bash | |
python train.py config/train_smollm3_openhermes_fr.py \ | |
--enable_tracking \ | |
--trackio_url "https://your-space.hf.space" \ | |
--experiment_name "smollm3_fr_v1" | |
``` | |
### 3. Deploy Model | |
```bash | |
python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \ | |
--trackio-url "https://your-space.hf.space" \ | |
--experiment-name "smollm3_fr_v1" | |
``` | |
### 4. Test Model | |
```python | |
from transformers import AutoModelForCausalLM, AutoTokenizer | |
model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1") | |
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1") | |
# Test French generation | |
prompt = "Qu'est-ce que l'apprentissage automatique?" | |
inputs = tokenizer(prompt, return_tensors="pt") | |
outputs = model.generate(**inputs, max_new_tokens=100) | |
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
``` | |
## Support and Resources | |
### Documentation | |
- [OpenHermes-FR Dataset](https://huggingface.co/datasets/legmlai/openhermes-fr) | |
- [SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) | |
- [Trackio Monitoring](https://github.com/Josephrp/trackio) | |
### Community | |
- [Hugging Face Forums](https://discuss.huggingface.co/) | |
- [Transformers Documentation](https://huggingface.co/docs/transformers/) | |
### Examples | |
- [French Language Models](https://huggingface.co/models?search=french) | |
- [Instruction Tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads) | |
## Conclusion | |
This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud: | |
- ✅ **Complete Setup** - From cloud instance to model deployment | |
- ✅ **Optimized Configuration** - Tailored for French instruction tuning | |
- ✅ **Monitoring Integration** - Trackio experiment tracking | |
- ✅ **Cost Optimization** - Tips for efficient cloud usage | |
- ✅ **Troubleshooting** - Solutions for common issues | |
Start training your French language model today! |