Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /CLOUD_TRAINING_GUIDE.md

Tonic

adds formatting fix

ebe598e verified 3 months ago

preview code

raw

history blame

11.5 kB

	# Cloud Training Guide for OpenHermes-FR Dataset

	This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the [legmlai/openhermes-fr](https://huggingface.co/datasets/legmlai/openhermes-fr) dataset.

	## Overview

	The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers:

	- ✅ Cloud Instance Setup - Complete environment configuration
	- ✅ Dataset Integration - Automatic loading and filtering
	- ✅ Training Configuration - Optimized for French instruction tuning
	- ✅ Monitoring Integration - Trackio experiment tracking
	- ✅ Model Deployment - Push to Hugging Face Hub

	## Dataset Information

	### Schema
	```json
	{
	"prompt": "Explique la différence entre la photosynthèse C3 et C4.",
	"accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)",
	"bad_prompt_detected": false,
	"bad_response_detected": false,
	"bad_entry": false
	}
	```

	### Key Features
	- Size: 799,875 examples (~1.4GB)
	- Language: 100% French
	- Quality: GPT-4o generated responses with automatic filtering
	- License: ODC-BY 1.0

	## Cloud Instance Setup

	### 1. Choose Your Cloud Provider

	#### AWS EC2 (Recommended)
	```bash
	# Launch instance with GPU
	# Recommended: g4dn.xlarge or g5.xlarge
	# AMI: Deep Learning AMI (Ubuntu 20.04)
	```

	#### Google Cloud Platform
	```bash
	# Launch instance with GPU
	# Recommended: n1-standard-4 with Tesla T4 or V100
	```

	#### Azure
	```bash
	# Launch instance with GPU
	# Recommended: Standard_NC6s_v3 or Standard_NC12s_v3
	```

	### 2. Instance Specifications

	#### Minimum Requirements
	- GPU: 16GB+ VRAM (Tesla T4, V100, or A100)
	- RAM: 32GB+ system memory
	- Storage: 100GB+ SSD
	- CPU: 8+ cores

	#### Recommended Specifications
	- GPU: A100 (40GB) or H100 (80GB)
	- RAM: 64GB+ system memory
	- Storage: 200GB+ NVMe SSD
	- CPU: 16+ cores

	### 3. Environment Setup

	```bash
	# Update system
	sudo apt update && sudo apt upgrade -y

	# Install CUDA (if not pre-installed)
	# Follow NVIDIA CUDA installation guide for your GPU

	# Install Python dependencies
	sudo apt install python3-pip python3-venv git -y

	# Create virtual environment
	python3 -m venv smollm3_env
	source smollm3_env/bin/activate

	# Clone repository
	git clone <your-repo-url>
	cd <your-repo-directory>

	# Install dependencies
	pip install -r requirements.txt

	# Install additional dependencies for cloud training
	pip install accelerate transformers datasets huggingface_hub
	```

	## Training Configuration

	### 1. Use the OpenHermes-FR Config

	The repository includes a specialized configuration for the OpenHermes-FR dataset:

	```bash
	python train.py config/train_smollm3_openhermes_fr.py \
	--enable_tracking \
	--trackio_url "https://your-space.hf.space" \
	--experiment_name "smollm3_fr_openhermes_v1"
	```

	### 2. Configuration Details

	The `config/train_smollm3_openhermes_fr.py` includes:

	#### Dataset Configuration
	```python
	dataset_name: str = "legmlai/openhermes-fr"
	dataset_split: str = "train"
	input_field: str = "prompt"
	target_field: str = "accepted_completion"
	filter_bad_entries: bool = True
	bad_entry_field: str = "bad_entry"
	```

	#### Training Optimization
	```python
	batch_size: int = 2 # Reduced for French text (longer sequences)
	gradient_accumulation_steps: int = 8 # Maintains effective batch size
	learning_rate: float = 1e-5 # Lower for instruction tuning
	max_iters: int = 2000 # More iterations for large dataset
	```

	#### Monitoring Integration
	```python
	enable_tracking: bool = True
	experiment_name: str = "smollm3_openhermes_fr"
	```

	## Training Commands

	### Basic Training
	```bash
	python train.py config/train_smollm3_openhermes_fr.py
	```

	### Training with Monitoring
	```bash
	python train.py config/train_smollm3_openhermes_fr.py \
	--enable_tracking \
	--trackio_url "https://your-trackio-space.hf.space" \
	--experiment_name "smollm3_fr_openhermes_v1"
	```

	### Training with Custom Parameters
	```bash
	python train.py config/train_smollm3_openhermes_fr.py \
	--batch_size 4 \
	--learning_rate 2e-5 \
	--max_iters 3000 \
	--enable_tracking \
	--trackio_url "https://your-trackio-space.hf.space" \
	--experiment_name "smollm3_fr_high_lr"
	```

	### Training with Checkpoint Resume
	```bash
	python train.py config/train_smollm3_openhermes_fr.py \
	--init_from resume \
	--enable_tracking \
	--trackio_url "https://your-trackio-space.hf.space" \
	--experiment_name "smollm3_fr_resume"
	```

	## Dataset Processing

	### Automatic Filtering

	The training script automatically:
	- ✅ Loads the OpenHermes-FR dataset from Hugging Face
	- ✅ Filters out bad entries (`bad_entry = true`)
	- ✅ Splits data into train/validation/test (98/1/1)
	- ✅ Formats prompts and completions for instruction tuning

	### Manual Dataset Inspection

	```python
	from datasets import load_dataset

	# Load dataset
	dataset = load_dataset("legmlai/openhermes-fr")

	# Check dataset info
	print(f"Dataset size: {len(dataset['train'])}")
	print(f"Sample columns: {dataset['train'].column_names}")

	# Check filtering
	bad_entries = dataset['train'].filter(lambda x: x['bad_entry'])
	print(f"Bad entries: {len(bad_entries)}")

	# Sample data
	sample = dataset['train'][0]
	print(f"Prompt: {sample['prompt']}")
	print(f"Completion: {sample['accepted_completion']}")
	```

	## Monitoring and Tracking

	### Trackio Integration

	The training automatically logs:
	- Training metrics: Loss, accuracy, learning rate
	- System metrics: GPU memory, CPU usage
	- Dataset info: Size, filtering statistics
	- Model checkpoints: Regular saves with metadata

	### View Training Progress

	1. Trackio Space: Visit your Trackio Space URL
	2. Experiment Details: Check the "View Experiments" tab
	3. Metrics: Monitor loss curves and system usage
	4. Logs: Download training logs for analysis

	## Model Deployment

	### Push to Hugging Face Hub

	After training, deploy your model:

	```bash
	python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \
	--trackio-url "https://your-trackio-space.hf.space" \
	--experiment-name "smollm3_fr_openhermes_v1"
	```

	### Use Your Model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load your fine-tuned model
	model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes")
	tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes")

	# Generate French text
	prompt = "Expliquez le concept de l'intelligence artificielle."
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=200)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Performance Optimization

	### GPU Memory Management

	```bash
	# Monitor GPU usage
	nvidia-smi -l 1

	# Optimize for your GPU
	# For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8
	# For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4
	# For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2
	```

	### Training Speed

	```bash
	# Use mixed precision (enabled by default)
	fp16: bool = True

	# Enable gradient checkpointing (enabled by default)
	use_gradient_checkpointing: bool = True

	# Use flash attention (enabled by default)
	use_flash_attention: bool = True
	```

	## Troubleshooting

	### Common Issues

	#### 1. Out of Memory (OOM)
	```bash
	# Reduce batch size
	python train.py config/train_smollm3_openhermes_fr.py --batch_size 1

	# Increase gradient accumulation
	# Edit config: gradient_accumulation_steps = 16
	```

	#### 2. Slow Training
	```bash
	# Check GPU utilization
	nvidia-smi

	# Verify data loading
	# Check if dataset is cached locally
	```

	#### 3. Dataset Loading Issues
	```bash
	# Clear cache
	rm -rf ~/.cache/huggingface/

	# Check internet connection
	# Verify dataset name: "legmlai/openhermes-fr"
	```

	#### 4. Monitoring Connection Issues
	```bash
	# Test Trackio connection
	curl -I https://your-trackio-space.hf.space

	# Check token permissions
	# Verify experiment name format
	```

	### Debug Mode

	```bash
	# Enable debug logging
	export LOG_LEVEL=DEBUG
	python train.py config/train_smollm3_openhermes_fr.py
	```

	## Cost Optimization

	### Cloud Provider Tips

	#### AWS EC2
	- Use Spot Instances for cost savings
	- Monitor usage with CloudWatch
	- Use appropriate instance types

	#### Google Cloud Platform
	- Use Preemptible VMs for non-critical training
	- Monitor with Cloud Monitoring
	- Use committed use discounts

	#### Azure
	- Use Spot VMs for cost optimization
	- Monitor with Azure Monitor
	- Use reserved instances for long training

	### Training Time Estimates

	\| GPU Type \| Batch Size \| Estimated Time \|
	\|----------\|------------\|----------------\|
	\| Tesla T4 (16GB) \| 2 \| 8-12 hours \|
	\| V100 (32GB) \| 4 \| 4-6 hours \|
	\| A100 (40GB) \| 8 \| 2-3 hours \|
	\| H100 (80GB) \| 16 \| 1-2 hours \|

	## Security Best Practices

	### Token Management
	```bash
	# Use environment variables
	export HF_TOKEN="your_token_here"
	export TRACKIO_TOKEN="your_trackio_token"

	# Don't hardcode in scripts
	# Use IAM roles when possible
	```

	### Data Privacy
	```bash
	# Use private repositories for sensitive models
	python push_to_huggingface.py model username/private-model --private

	# Secure your cloud instance
	# Use VPC and security groups
	```

	## Complete Workflow Example

	### 1. Setup Cloud Instance
	```bash
	# Launch GPU instance
	# Install dependencies
	git clone <your-repo>
	cd <your-repo>
	pip install -r requirements.txt
	```

	### 2. Train Model
	```bash
	python train.py config/train_smollm3_openhermes_fr.py \
	--enable_tracking \
	--trackio_url "https://your-space.hf.space" \
	--experiment_name "smollm3_fr_v1"
	```

	### 3. Deploy Model
	```bash
	python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \
	--trackio-url "https://your-space.hf.space" \
	--experiment-name "smollm3_fr_v1"
	```

	### 4. Test Model
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1")
	tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1")

	# Test French generation
	prompt = "Qu'est-ce que l'apprentissage automatique?"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Support and Resources

	### Documentation
	- [OpenHermes-FR Dataset](https://huggingface.co/datasets/legmlai/openhermes-fr)
	- [SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
	- [Trackio Monitoring](https://github.com/Josephrp/trackio)

	### Community
	- [Hugging Face Forums](https://discuss.huggingface.co/)
	- [Transformers Documentation](https://huggingface.co/docs/transformers/)

	### Examples
	- [French Language Models](https://huggingface.co/models?search=french)
	- [Instruction Tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)

	## Conclusion

	This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud:

	- ✅ Complete Setup - From cloud instance to model deployment
	- ✅ Optimized Configuration - Tailored for French instruction tuning
	- ✅ Monitoring Integration - Trackio experiment tracking
	- ✅ Cost Optimization - Tips for efficient cloud usage
	- ✅ Troubleshooting - Solutions for common issues

	Start training your French language model today!