SmolFactory / docs /PUSH_SCRIPT_GUIDE.md
Tonic's picture
adds formatting fix
ebe598e verified
|
raw
history blame
9.11 kB

πŸš€ Push to Hugging Face Script Guide

Overview

The push_to_huggingface.py script has been enhanced to integrate with HF Datasets for experiment tracking and provides complete model deployment with persistent experiment storage.

πŸš€ Key Improvements

1. HF Datasets Integration

  • βœ… Dataset Repository Support: Configurable dataset repository for experiment storage
  • βœ… Environment Variables: Automatic detection of HF_TOKEN and TRACKIO_DATASET_REPO
  • βœ… Enhanced Logging: Logs push actions to both Trackio and HF Datasets
  • βœ… Model Card Integration: Includes dataset repository information in model cards

2. Enhanced Configuration

  • βœ… Flexible Token Input: Multiple ways to provide HF token
  • βœ… Dataset Repository Tracking: Links models to their experiment datasets
  • βœ… Environment Variable Support: Fallback to environment variables
  • βœ… Command Line Arguments: New arguments for HF Datasets integration

3. Improved Model Cards

  • βœ… Dataset Repository Info: Shows which dataset contains experiment data
  • βœ… Experiment Tracking Section: Explains how to access training data
  • βœ… Enhanced Documentation: Better model cards with experiment links

πŸ“‹ Usage Examples

Basic Usage

# Push model with default settings
python push_to_huggingface.py /path/to/model username/repo-name

With HF Datasets Integration

# Push model with custom dataset repository
python push_to_huggingface.py /path/to/model username/repo-name \
  --dataset-repo username/experiments

With Custom Token

# Push model with custom HF token
python push_to_huggingface.py /path/to/model username/repo-name \
  --hf-token your_token_here

Complete Example

# Push model with all options
python push_to_huggingface.py /path/to/model username/repo-name \
  --dataset-repo username/experiments \
  --hf-token your_token_here \
  --private \
  --experiment-name "smollm3_finetune_v2"

πŸ”§ Command Line Arguments

Argument Required Default Description
model_path βœ… Yes None Path to trained model directory
repo_name βœ… Yes None HF repository name (username/repo-name)
--token ❌ No HF_TOKEN env Hugging Face token
--hf-token ❌ No HF_TOKEN env HF token (alternative to --token)
--private ❌ No False Make repository private
--trackio-url ❌ No None Trackio Space URL for logging
--experiment-name ❌ No None Experiment name for Trackio
--dataset-repo ❌ No TRACKIO_DATASET_REPO env HF Dataset repository

πŸ› οΈ Configuration Methods

Method 1: Command Line Arguments

python push_to_huggingface.py model_path repo_name \
  --dataset-repo username/experiments \
  --hf-token your_token_here

Method 2: Environment Variables

export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=username/experiments
python push_to_huggingface.py model_path repo_name

Method 3: Hybrid Approach

# Set defaults via environment variables
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=username/experiments

# Override specific values via command line
python push_to_huggingface.py model_path repo_name \
  --dataset-repo username/specific-experiments

πŸ“Š What Gets Pushed

Model Files

  • βœ… Model Weights: pytorch_model.bin
  • βœ… Configuration: config.json
  • βœ… Tokenizer: tokenizer.json, tokenizer_config.json
  • βœ… All Other Files: Any additional files in model directory

Documentation

  • βœ… Model Card: Comprehensive README.md with model information
  • βœ… Training Configuration: JSON configuration used for training
  • βœ… Training Results: JSON results and metrics
  • βœ… Training Logs: Text logs from training process

Experiment Data

  • βœ… Dataset Repository: Links to HF Dataset containing experiment data
  • βœ… Training Metrics: All training metrics stored in dataset
  • βœ… Configuration: Training configuration stored in dataset
  • βœ… Artifacts: Training artifacts and logs

πŸ” Enhanced Model Cards

The improved script creates enhanced model cards that include:

Model Information

  • Base model and architecture
  • Training date and model size
  • Dataset repository for experiment data

Training Configuration

  • Complete training parameters
  • Hardware information
  • Training duration and steps

Experiment Tracking

  • Links to HF Dataset repository
  • Instructions for accessing experiment data
  • Training metrics and results

Usage Examples

  • Code examples for loading and using the model
  • Generation examples
  • Performance information

πŸ“ˆ Logging Integration

Trackio Logging

  • βœ… Push Actions: Logs model push events
  • βœ… Model Information: Repository name, size, configuration
  • βœ… Training Data: Links to experiment dataset

HF Datasets Logging

  • βœ… Experiment Summary: Final training summary
  • βœ… Push Metadata: Model repository and push date
  • βœ… Configuration: Complete training configuration

Dual Storage

  • βœ… Trackio: Real-time monitoring and visualization
  • βœ… HF Datasets: Persistent experiment storage
  • βœ… Synchronized: Both systems updated together

🚨 Troubleshooting

Issue: "Missing required files"

Solutions:

  1. Check model directory contains required files
  2. Ensure model was saved correctly during training
  3. Verify file permissions

Issue: "Failed to create repository"

Solutions:

  1. Check HF token has write permissions
  2. Verify repository name format: username/repo-name
  3. Ensure repository doesn't already exist (or use --private)

Issue: "Failed to upload files"

Solutions:

  1. Check network connectivity
  2. Verify HF token is valid
  3. Ensure repository was created successfully

Issue: "Dataset repository not found"

Solutions:

  1. Check dataset repository exists
  2. Verify HF token has read access
  3. Use --dataset-repo to specify correct repository

πŸ“‹ Workflow Integration

Complete Training Workflow

  1. Train Model: Use training scripts with monitoring
  2. Monitor Progress: View metrics in Trackio interface
  3. Push Model: Use improved push script
  4. Access Data: View experiments in HF Dataset repository

Example Workflow

# 1. Train model with monitoring
python train.py config/train_smollm3_openhermes_fr.py \
  --experiment_name "smollm3_french_v2"

# 2. Push model to HF Hub
python push_to_huggingface.py outputs/model username/smollm3-french \
  --dataset-repo username/experiments \
  --experiment-name "smollm3_french_v2"

# 3. View results
# - Model: https://huggingface.co/username/smollm3-french
# - Experiments: https://huggingface.co/datasets/username/experiments
# - Trackio: Your Trackio Space interface

🎯 Benefits

For Model Deployment

  • βœ… Complete Documentation: Enhanced model cards with experiment links
  • βœ… Persistent Storage: Experiment data stored in HF Datasets
  • βœ… Easy Access: Direct links to training data and metrics
  • βœ… Reproducibility: Complete training configuration included

For Experiment Management

  • βœ… Centralized Storage: All experiments in HF Dataset repository
  • βœ… Version Control: Model versions linked to experiment data
  • βœ… Collaboration: Share experiments and models easily
  • βœ… Searchability: Easy to find specific experiments

For Development

  • βœ… Flexible Configuration: Multiple ways to set parameters
  • βœ… Backward Compatible: Works with existing setups
  • βœ… Error Handling: Clear error messages and troubleshooting
  • βœ… Integration: Works with existing monitoring system

πŸ“Š Testing Results

All push script tests passed:

  • βœ… HuggingFacePusher Initialization: Works with new parameters
  • βœ… Model Card Creation: Includes HF Datasets integration
  • βœ… Logging Integration: Logs to both Trackio and HF Datasets
  • βœ… Argument Parsing: Handles new command line arguments
  • βœ… Environment Variables: Proper fallback handling

πŸ”„ Migration Guide

From Old Script

# Old way
python push_to_huggingface.py model_path repo_name --token your_token

# New way (same functionality)
python push_to_huggingface.py model_path repo_name --hf-token your_token

# New way with HF Datasets
python push_to_huggingface.py model_path repo_name \
  --hf-token your_token \
  --dataset-repo username/experiments

Environment Variables

# Set environment variables for automatic detection
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=username/experiments

# Then use simple command
python push_to_huggingface.py model_path repo_name

πŸŽ‰ Your push script is now fully integrated with HF Datasets for complete experiment tracking and model deployment!