Spaces:

Tonic
/

SmolFactory

Running

File size: 9,106 Bytes

ebe598e

# 🚀 Push to Hugging Face Script Guide

## Overview

The `push_to_huggingface.py` script has been enhanced to integrate with **HF Datasets** for experiment tracking and provides complete model deployment with persistent experiment storage.

## 🚀 Key Improvements

### **1. HF Datasets Integration**
- ✅ **Dataset Repository Support**: Configurable dataset repository for experiment storage
- ✅ **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO`
- ✅ **Enhanced Logging**: Logs push actions to both Trackio and HF Datasets
- ✅ **Model Card Integration**: Includes dataset repository information in model cards

### **2. Enhanced Configuration**
- ✅ **Flexible Token Input**: Multiple ways to provide HF token
- ✅ **Dataset Repository Tracking**: Links models to their experiment datasets
- ✅ **Environment Variable Support**: Fallback to environment variables
- ✅ **Command Line Arguments**: New arguments for HF Datasets integration

### **3. Improved Model Cards**
- ✅ **Dataset Repository Info**: Shows which dataset contains experiment data
- ✅ **Experiment Tracking Section**: Explains how to access training data
- ✅ **Enhanced Documentation**: Better model cards with experiment links

## 📋 Usage Examples

### **Basic Usage**
```bash
# Push model with default settings
python push_to_huggingface.py /path/to/model username/repo-name
```

### **With HF Datasets Integration**
```bash
# Push model with custom dataset repository
python push_to_huggingface.py /path/to/model username/repo-name \
  --dataset-repo username/experiments
```

### **With Custom Token**
```bash
# Push model with custom HF token
python push_to_huggingface.py /path/to/model username/repo-name \
  --hf-token your_token_here
```

### **Complete Example**
```bash
# Push model with all options
python push_to_huggingface.py /path/to/model username/repo-name \
  --dataset-repo username/experiments \
  --hf-token your_token_here \
  --private \
  --experiment-name "smollm3_finetune_v2"
```

## 🔧 Command Line Arguments

| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| `model_path` | ✅ Yes | None | Path to trained model directory |
| `repo_name` | ✅ Yes | None | HF repository name (username/repo-name) |
| `--token` | ❌ No | `HF_TOKEN` env | Hugging Face token |
| `--hf-token` | ❌ No | `HF_TOKEN` env | HF token (alternative to --token) |
| `--private` | ❌ No | False | Make repository private |
| `--trackio-url` | ❌ No | None | Trackio Space URL for logging |
| `--experiment-name` | ❌ No | None | Experiment name for Trackio |
| `--dataset-repo` | ❌ No | `TRACKIO_DATASET_REPO` env | HF Dataset repository |

## 🛠️ Configuration Methods

### **Method 1: Command Line Arguments**
```bash
python push_to_huggingface.py model_path repo_name \
  --dataset-repo username/experiments \
  --hf-token your_token_here
```

### **Method 2: Environment Variables**
```bash
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=username/experiments
python push_to_huggingface.py model_path repo_name
```

### **Method 3: Hybrid Approach**
```bash
# Set defaults via environment variables
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=username/experiments

# Override specific values via command line
python push_to_huggingface.py model_path repo_name \
  --dataset-repo username/specific-experiments
```

## 📊 What Gets Pushed

### **Model Files**
- ✅ **Model Weights**: `pytorch_model.bin`
- ✅ **Configuration**: `config.json`
- ✅ **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`
- ✅ **All Other Files**: Any additional files in model directory

### **Documentation**
- ✅ **Model Card**: Comprehensive README.md with model information
- ✅ **Training Configuration**: JSON configuration used for training
- ✅ **Training Results**: JSON results and metrics
- ✅ **Training Logs**: Text logs from training process

### **Experiment Data**
- ✅ **Dataset Repository**: Links to HF Dataset containing experiment data
- ✅ **Training Metrics**: All training metrics stored in dataset
- ✅ **Configuration**: Training configuration stored in dataset
- ✅ **Artifacts**: Training artifacts and logs

## 🔍 Enhanced Model Cards

The improved script creates enhanced model cards that include:

### **Model Information**
- Base model and architecture
- Training date and model size
- **Dataset repository** for experiment data

### **Training Configuration**
- Complete training parameters
- Hardware information
- Training duration and steps

### **Experiment Tracking**
- Links to HF Dataset repository
- Instructions for accessing experiment data
- Training metrics and results

### **Usage Examples**
- Code examples for loading and using the model
- Generation examples
- Performance information

## 📈 Logging Integration

### **Trackio Logging**
- ✅ **Push Actions**: Logs model push events
- ✅ **Model Information**: Repository name, size, configuration
- ✅ **Training Data**: Links to experiment dataset

### **HF Datasets Logging**
- ✅ **Experiment Summary**: Final training summary
- ✅ **Push Metadata**: Model repository and push date
- ✅ **Configuration**: Complete training configuration

### **Dual Storage**
- ✅ **Trackio**: Real-time monitoring and visualization
- ✅ **HF Datasets**: Persistent experiment storage
- ✅ **Synchronized**: Both systems updated together

## 🚨 Troubleshooting

### **Issue: "Missing required files"**
**Solutions**:
1. Check model directory contains required files
2. Ensure model was saved correctly during training
3. Verify file permissions

### **Issue: "Failed to create repository"**
**Solutions**:
1. Check HF token has write permissions
2. Verify repository name format: `username/repo-name`
3. Ensure repository doesn't already exist (or use `--private`)

### **Issue: "Failed to upload files"**
**Solutions**:
1. Check network connectivity
2. Verify HF token is valid
3. Ensure repository was created successfully

### **Issue: "Dataset repository not found"**
**Solutions**:
1. Check dataset repository exists
2. Verify HF token has read access
3. Use `--dataset-repo` to specify correct repository

## 📋 Workflow Integration

### **Complete Training Workflow**
1. **Train Model**: Use training scripts with monitoring
2. **Monitor Progress**: View metrics in Trackio interface
3. **Push Model**: Use improved push script
4. **Access Data**: View experiments in HF Dataset repository

### **Example Workflow**
```bash
# 1. Train model with monitoring
python train.py config/train_smollm3_openhermes_fr.py \
  --experiment_name "smollm3_french_v2"

# 2. Push model to HF Hub
python push_to_huggingface.py outputs/model username/smollm3-french \
  --dataset-repo username/experiments \
  --experiment-name "smollm3_french_v2"

# 3. View results
# - Model: https://huggingface.co/username/smollm3-french
# - Experiments: https://huggingface.co/datasets/username/experiments
# - Trackio: Your Trackio Space interface
```

## 🎯 Benefits

### **For Model Deployment**
- ✅ **Complete Documentation**: Enhanced model cards with experiment links
- ✅ **Persistent Storage**: Experiment data stored in HF Datasets
- ✅ **Easy Access**: Direct links to training data and metrics
- ✅ **Reproducibility**: Complete training configuration included

### **For Experiment Management**
- ✅ **Centralized Storage**: All experiments in HF Dataset repository
- ✅ **Version Control**: Model versions linked to experiment data
- ✅ **Collaboration**: Share experiments and models easily
- ✅ **Searchability**: Easy to find specific experiments

### **For Development**
- ✅ **Flexible Configuration**: Multiple ways to set parameters
- ✅ **Backward Compatible**: Works with existing setups
- ✅ **Error Handling**: Clear error messages and troubleshooting
- ✅ **Integration**: Works with existing monitoring system

## 📊 Testing Results

All push script tests passed:
- ✅ **HuggingFacePusher Initialization**: Works with new parameters
- ✅ **Model Card Creation**: Includes HF Datasets integration
- ✅ **Logging Integration**: Logs to both Trackio and HF Datasets
- ✅ **Argument Parsing**: Handles new command line arguments
- ✅ **Environment Variables**: Proper fallback handling

## 🔄 Migration Guide

### **From Old Script**
```bash
# Old way
python push_to_huggingface.py model_path repo_name --token your_token

# New way (same functionality)
python push_to_huggingface.py model_path repo_name --hf-token your_token

# New way with HF Datasets
python push_to_huggingface.py model_path repo_name \
  --hf-token your_token \
  --dataset-repo username/experiments
```

### **Environment Variables**
```bash
# Set environment variables for automatic detection
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=username/experiments

# Then use simple command
python push_to_huggingface.py model_path repo_name
```

---

**🎉 Your push script is now fully integrated with HF Datasets for complete experiment tracking and model deployment!**