SmolFactory / docs /MONITORING_INTEGRATION_GUIDE.md
Tonic's picture
adds formatting fix
ebe598e verified
|
raw
history blame
7.04 kB
# πŸ”§ Improved Monitoring Integration Guide
## Overview
The monitoring system has been enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.
## πŸš€ Key Improvements
### 1. **HF Datasets Integration**
- βœ… **Persistent Storage**: Experiments are saved to HF Datasets repositories
- βœ… **Environment Variables**: Configurable via `HF_TOKEN` and `TRACKIO_DATASET_REPO`
- βœ… **Fallback Support**: Graceful degradation if HF Datasets unavailable
- βœ… **Automatic Backup**: Local files as backup
### 2. **Enhanced Monitoring Features**
- πŸ“Š **Real-time Metrics**: Training metrics logged to both Trackio and HF Datasets
- πŸ”§ **System Metrics**: GPU memory, CPU usage, and system performance
- πŸ“ˆ **Training Summaries**: Comprehensive experiment summaries
- πŸ›‘οΈ **Error Handling**: Robust error logging and recovery
### 3. **Easy Integration**
- πŸ”Œ **Automatic Setup**: Environment variables automatically detected
- πŸ“ **Configuration**: Simple setup with environment variables
- πŸ”„ **Backward Compatible**: Works with existing Trackio setup
## πŸ“‹ Environment Variables
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `HF_TOKEN` | βœ… Yes | None | Your Hugging Face token |
| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository |
| `TRACKIO_URL` | ❌ No | None | Trackio server URL |
| `TRACKIO_TOKEN` | ❌ No | None | Trackio authentication token |
## πŸ› οΈ Setup Instructions
### 1. **Get Your HF Token**
```bash
# Go to https://huggingface.co/settings/tokens
# Create a new token with "Write" permissions
# Copy the token
```
### 2. **Set Environment Variables**
```bash
# For HF Spaces, add these to your Space settings:
HF_TOKEN=your_hf_token_here
TRACKIO_DATASET_REPO=your-username/your-dataset-name
# For local development:
export HF_TOKEN=your_hf_token_here
export TRACKIO_DATASET_REPO=your-username/your-dataset-name
```
### 3. **Create Dataset Repository**
```bash
# Run the setup script
python setup_hf_dataset.py
# Or manually create a dataset on HF Hub
# Go to https://huggingface.co/datasets
# Create a new dataset repository
```
### 4. **Test Configuration**
```bash
# Test your setup
python configure_trackio.py
# Test dataset access
python test_hf_datasets.py
```
## πŸš€ Usage Examples
### **Basic Training with Monitoring**
```bash
# Train with default monitoring
python train.py config/train_smollm3_openhermes_fr.py
# Train with custom dataset repository
TRACKIO_DATASET_REPO=your-username/smollm3-experiments python train.py config/train_smollm3_openhermes_fr.py
```
### **Advanced Training Configuration**
```bash
# Train with custom experiment name
python train.py config/train_smollm3_openhermes_fr.py \
--experiment_name "smollm3_french_tuning_v2" \
--hf_token your_token_here \
--dataset_repo your-username/french-experiments
```
### **Training Scripts with Monitoring**
```bash
# All training scripts now support monitoring:
python train.py config/train_smollm3_openhermes_fr_a100_balanced.py
python train.py config/train_smollm3_openhermes_fr_a100_large.py
python train.py config/train_smollm3_openhermes_fr_a100_max_performance.py
python train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py
```
## πŸ“Š What Gets Monitored
### **Training Metrics**
- Loss values (training and validation)
- Learning rate
- Gradient norms
- Training steps and epochs
### **System Metrics**
- GPU memory usage
- GPU utilization
- CPU usage
- Memory usage
### **Experiment Data**
- Configuration parameters
- Model checkpoints
- Evaluation results
- Training summaries
### **Artifacts**
- Configuration files
- Training logs
- Evaluation results
- Model checkpoints
## πŸ” Viewing Results
### **1. Trackio Interface**
- Visit your Trackio Space
- Navigate to "Experiments" tab
- View real-time metrics and plots
### **2. HF Dataset Repository**
- Go to your dataset repository on HF Hub
- Browse experiment data
- Download experiment files
### **3. Local Files**
- Check local backup files
- Review training logs
- Examine configuration files
## πŸ› οΈ Configuration Examples
### **Default Setup**
```python
# Uses default dataset: tonic/trackio-experiments
# Requires only HF_TOKEN
```
### **Personal Dataset**
```bash
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=your-username/trackio-experiments
```
### **Team Dataset**
```bash
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=your-org/team-experiments
```
### **Project-Specific Dataset**
```bash
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=your-username/smollm3-experiments
```
## πŸ”§ Troubleshooting
### **Issue: "HF_TOKEN not found"**
```bash
# Solution: Set your HF token
export HF_TOKEN=your_token_here
# Or add to HF Space environment variables
```
### **Issue: "Failed to load dataset"**
```bash
# Solutions:
# 1. Check token has read access
# 2. Verify dataset repository exists
# 3. Run setup script: python setup_hf_dataset.py
```
### **Issue: "Failed to save experiments"**
```bash
# Solutions:
# 1. Check token has write permissions
# 2. Verify dataset repository exists
# 3. Check network connectivity
```
### **Issue: "Monitoring not working"**
```bash
# Solutions:
# 1. Check environment variables
# 2. Run configuration test: python configure_trackio.py
# 3. Check logs for specific errors
```
## πŸ“ˆ Benefits
### **For HF Spaces Deployment**
- βœ… **Persistent Storage**: Data survives Space restarts
- βœ… **No Local Storage**: No dependency on ephemeral storage
- βœ… **Scalable**: Works with any dataset size
- βœ… **Secure**: Private dataset storage
### **For Experiment Management**
- βœ… **Centralized**: All experiments in one place
- βœ… **Searchable**: Easy to find specific experiments
- βœ… **Versioned**: Dataset versioning for experiments
- βœ… **Collaborative**: Share experiments with team
### **For Development**
- βœ… **Flexible**: Easy to switch between datasets
- βœ… **Configurable**: Environment-based configuration
- βœ… **Robust**: Fallback mechanisms
- βœ… **Debuggable**: Comprehensive logging
## 🎯 Next Steps
1. **Set up your HF token and dataset repository**
2. **Test the configuration with `python configure_trackio.py`**
3. **Run a training experiment to verify monitoring**
4. **Check your HF Dataset repository for experiment data**
5. **View results in your Trackio interface**
## πŸ“š Related Files
- `monitoring.py` - Enhanced monitoring with HF Datasets support
- `train.py` - Updated training script with monitoring integration
- `configure_trackio.py` - Configuration and testing script
- `setup_hf_dataset.py` - Dataset repository setup
- `test_hf_datasets.py` - Dataset access testing
- `ENVIRONMENT_VARIABLES.md` - Environment variable reference
- `HF_DATASETS_GUIDE.md` - Detailed HF Datasets guide
---
**πŸŽ‰ Your experiments are now persistently stored and easily accessible!**