Spaces:
Running
Running
# π§ Improved Monitoring Integration Guide | |
## Overview | |
The monitoring system has been enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments. | |
## π Key Improvements | |
### 1. **HF Datasets Integration** | |
- β **Persistent Storage**: Experiments are saved to HF Datasets repositories | |
- β **Environment Variables**: Configurable via `HF_TOKEN` and `TRACKIO_DATASET_REPO` | |
- β **Fallback Support**: Graceful degradation if HF Datasets unavailable | |
- β **Automatic Backup**: Local files as backup | |
### 2. **Enhanced Monitoring Features** | |
- π **Real-time Metrics**: Training metrics logged to both Trackio and HF Datasets | |
- π§ **System Metrics**: GPU memory, CPU usage, and system performance | |
- π **Training Summaries**: Comprehensive experiment summaries | |
- π‘οΈ **Error Handling**: Robust error logging and recovery | |
### 3. **Easy Integration** | |
- π **Automatic Setup**: Environment variables automatically detected | |
- π **Configuration**: Simple setup with environment variables | |
- π **Backward Compatible**: Works with existing Trackio setup | |
## π Environment Variables | |
| Variable | Required | Default | Description | | |
|----------|----------|---------|-------------| | |
| `HF_TOKEN` | β Yes | None | Your Hugging Face token | | |
| `TRACKIO_DATASET_REPO` | β No | `tonic/trackio-experiments` | Dataset repository | | |
| `TRACKIO_URL` | β No | None | Trackio server URL | | |
| `TRACKIO_TOKEN` | β No | None | Trackio authentication token | | |
## π οΈ Setup Instructions | |
### 1. **Get Your HF Token** | |
```bash | |
# Go to https://huggingface.co/settings/tokens | |
# Create a new token with "Write" permissions | |
# Copy the token | |
``` | |
### 2. **Set Environment Variables** | |
```bash | |
# For HF Spaces, add these to your Space settings: | |
HF_TOKEN=your_hf_token_here | |
TRACKIO_DATASET_REPO=your-username/your-dataset-name | |
# For local development: | |
export HF_TOKEN=your_hf_token_here | |
export TRACKIO_DATASET_REPO=your-username/your-dataset-name | |
``` | |
### 3. **Create Dataset Repository** | |
```bash | |
# Run the setup script | |
python setup_hf_dataset.py | |
# Or manually create a dataset on HF Hub | |
# Go to https://huggingface.co/datasets | |
# Create a new dataset repository | |
``` | |
### 4. **Test Configuration** | |
```bash | |
# Test your setup | |
python configure_trackio.py | |
# Test dataset access | |
python test_hf_datasets.py | |
``` | |
## π Usage Examples | |
### **Basic Training with Monitoring** | |
```bash | |
# Train with default monitoring | |
python train.py config/train_smollm3_openhermes_fr.py | |
# Train with custom dataset repository | |
TRACKIO_DATASET_REPO=your-username/smollm3-experiments python train.py config/train_smollm3_openhermes_fr.py | |
``` | |
### **Advanced Training Configuration** | |
```bash | |
# Train with custom experiment name | |
python train.py config/train_smollm3_openhermes_fr.py \ | |
--experiment_name "smollm3_french_tuning_v2" \ | |
--hf_token your_token_here \ | |
--dataset_repo your-username/french-experiments | |
``` | |
### **Training Scripts with Monitoring** | |
```bash | |
# All training scripts now support monitoring: | |
python train.py config/train_smollm3_openhermes_fr_a100_balanced.py | |
python train.py config/train_smollm3_openhermes_fr_a100_large.py | |
python train.py config/train_smollm3_openhermes_fr_a100_max_performance.py | |
python train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py | |
``` | |
## π What Gets Monitored | |
### **Training Metrics** | |
- Loss values (training and validation) | |
- Learning rate | |
- Gradient norms | |
- Training steps and epochs | |
### **System Metrics** | |
- GPU memory usage | |
- GPU utilization | |
- CPU usage | |
- Memory usage | |
### **Experiment Data** | |
- Configuration parameters | |
- Model checkpoints | |
- Evaluation results | |
- Training summaries | |
### **Artifacts** | |
- Configuration files | |
- Training logs | |
- Evaluation results | |
- Model checkpoints | |
## π Viewing Results | |
### **1. Trackio Interface** | |
- Visit your Trackio Space | |
- Navigate to "Experiments" tab | |
- View real-time metrics and plots | |
### **2. HF Dataset Repository** | |
- Go to your dataset repository on HF Hub | |
- Browse experiment data | |
- Download experiment files | |
### **3. Local Files** | |
- Check local backup files | |
- Review training logs | |
- Examine configuration files | |
## π οΈ Configuration Examples | |
### **Default Setup** | |
```python | |
# Uses default dataset: tonic/trackio-experiments | |
# Requires only HF_TOKEN | |
``` | |
### **Personal Dataset** | |
```bash | |
export HF_TOKEN=your_token_here | |
export TRACKIO_DATASET_REPO=your-username/trackio-experiments | |
``` | |
### **Team Dataset** | |
```bash | |
export HF_TOKEN=your_token_here | |
export TRACKIO_DATASET_REPO=your-org/team-experiments | |
``` | |
### **Project-Specific Dataset** | |
```bash | |
export HF_TOKEN=your_token_here | |
export TRACKIO_DATASET_REPO=your-username/smollm3-experiments | |
``` | |
## π§ Troubleshooting | |
### **Issue: "HF_TOKEN not found"** | |
```bash | |
# Solution: Set your HF token | |
export HF_TOKEN=your_token_here | |
# Or add to HF Space environment variables | |
``` | |
### **Issue: "Failed to load dataset"** | |
```bash | |
# Solutions: | |
# 1. Check token has read access | |
# 2. Verify dataset repository exists | |
# 3. Run setup script: python setup_hf_dataset.py | |
``` | |
### **Issue: "Failed to save experiments"** | |
```bash | |
# Solutions: | |
# 1. Check token has write permissions | |
# 2. Verify dataset repository exists | |
# 3. Check network connectivity | |
``` | |
### **Issue: "Monitoring not working"** | |
```bash | |
# Solutions: | |
# 1. Check environment variables | |
# 2. Run configuration test: python configure_trackio.py | |
# 3. Check logs for specific errors | |
``` | |
## π Benefits | |
### **For HF Spaces Deployment** | |
- β **Persistent Storage**: Data survives Space restarts | |
- β **No Local Storage**: No dependency on ephemeral storage | |
- β **Scalable**: Works with any dataset size | |
- β **Secure**: Private dataset storage | |
### **For Experiment Management** | |
- β **Centralized**: All experiments in one place | |
- β **Searchable**: Easy to find specific experiments | |
- β **Versioned**: Dataset versioning for experiments | |
- β **Collaborative**: Share experiments with team | |
### **For Development** | |
- β **Flexible**: Easy to switch between datasets | |
- β **Configurable**: Environment-based configuration | |
- β **Robust**: Fallback mechanisms | |
- β **Debuggable**: Comprehensive logging | |
## π― Next Steps | |
1. **Set up your HF token and dataset repository** | |
2. **Test the configuration with `python configure_trackio.py`** | |
3. **Run a training experiment to verify monitoring** | |
4. **Check your HF Dataset repository for experiment data** | |
5. **View results in your Trackio interface** | |
## π Related Files | |
- `monitoring.py` - Enhanced monitoring with HF Datasets support | |
- `train.py` - Updated training script with monitoring integration | |
- `configure_trackio.py` - Configuration and testing script | |
- `setup_hf_dataset.py` - Dataset repository setup | |
- `test_hf_datasets.py` - Dataset access testing | |
- `ENVIRONMENT_VARIABLES.md` - Environment variable reference | |
- `HF_DATASETS_GUIDE.md` - Detailed HF Datasets guide | |
--- | |
**π Your experiments are now persistently stored and easily accessible!** |