SmolFactory / README.md
Tonic's picture
updates readmes
78a7472

A newer version of the Gradio SDK is available: 5.45.0

Upgrade
metadata
title: SmolFactory
emoji: 🏭
colorFrom: blue
colorTo: pink
sdk: gradio
sdk_version: 5.42.0
app_file: interface.py
pinned: false
short_description: SmolFactory is a e2e model maker

πŸ€— Hugging Face   |   πŸ€– demo   |    πŸ“‘ Blog    |   πŸ–₯️ Model
Monitoring   |    Join us on Discord    |   Dataset

🀏🏻🏭SmolFactory

SmolFactory helps you train, monitor and deploy your SmolLM3 and GPT-OSS fine-tunes, and more!

Screenshot 1 Screenshot 2 Screenshot 3

Train and deploy your model with one simple command !

πŸ€– Automatically Push Model, Spaces, Datasets & Monitoring

  • Automatic Deployment: Spaces created and configured automatically during the pipeline
  • Trackio Monitoring Space: Real-time training metrics, loss curves, and resource utilization
  • Demo Spaces: Instant web interfaces for model testing and demonstration
  • Real-time Metrics: Live training loss, learning rate, gradient norms, and GPU utilization
  • Custom Dashboards: Tailored visualizations for SmolLM3 and GPT-OSS fine-tuning
  • Artifact Logging: Model checkpoints, configuration files, and training logs
  • Experiment Comparison: Side-by-side analysis of different training runs
  • Alert System: Notifications for training issues or completion
  • Integration: Seamless connection with HF Spaces for public monitoring
  • Experiment Tracking: All training data, metrics, and artifacts stored in HF Datasets
  • Reproducibility: Complete experiment history with configuration snapshots
  • Collaboration: Easy sharing of training results and model comparisons
  • Version Control: Track dataset changes and model performance over time
  • GPT-OSS Support: Specialized configurations for OpenAI's GPT-OSS-20B model with LoRA and multilingual reasoning

πŸš€ Quick Start

Interactive Pipeline (Recommended)

The easiest way to get started is using the interactive pipeline:

./launch.sh

This script will:

  1. Authenticate with Hugging Face (write + read tokens)
  2. Configure training parameters interactively (SmolLM3 or GPT-OSS)
  3. Deploy Trackio Space for monitoring
  4. Setup HF Dataset for experiment tracking
  5. Execute training with your chosen configuration
  6. Push model to HF Hub with comprehensive documentation
  7. Deploy demo space for testing (optional)

Manual Setup

For advanced users who want to customize the pipeline:

# 1. Install dependencies
pip install -r requirements/requirements_core.txt

# 2. Configure your training
python scripts/training/train.py \
    --config config/train_smollm3_h100_lightweight.py \
    --experiment-name "my-experiment" \
    --output-dir ./outputs \
    --trackio-url "https://huggingface.co/spaces/username/trackio-monitoring"

# 3. Push model to HF Hub
python scripts/model_tonic/push_to_huggingface.py \
    ./outputs username/model-name \
    --token YOUR_HF_TOKEN

πŸ—οΈ Repository Architecture

graph LR
    Entry_Point["Entry Point"]
    Configuration_Management["Configuration Management"]
    Data_Pipeline["Data Pipeline"]
    Model_Abstraction["Model Abstraction"]
    Training_Orchestrator["Training Orchestrator"]
    Entry_Point -- "Initializes and Uses" --> Configuration_Management
    Entry_Point -- "Initializes" --> Data_Pipeline
    Entry_Point -- "Initializes" --> Model_Abstraction
    Entry_Point -- "Initializes and Invokes" --> Training_Orchestrator
    Configuration_Management -- "Provides Configuration To" --> Model_Abstraction
    Configuration_Management -- "Provides Configuration To" --> Data_Pipeline
    Configuration_Management -- "Provides Configuration To" --> Training_Orchestrator
    Data_Pipeline -- "Provides Data To" --> Training_Orchestrator
    Model_Abstraction -- "Provides Model To" --> Training_Orchestrator
    click Entry_Point href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Entry_Point.md" "Details"
    click Configuration_Management href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Configuration_Management.md" "Details"
    click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Data_Pipeline.md" "Details"
    click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
    click Training_Orchestrator href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Training_Orchestrator.md" "Details"

πŸ”§ Core Components

Configuration System (config/)

All training configurations inherit from SmolLM3Config:

# config/my_config.py
from config.train_smollm3 import SmolLM3Config

config = SmolLM3Config(
    model_name="HuggingFaceTB/SmolLM3-3B",
    max_seq_length=8192,
    batch_size=8,
    learning_rate=5e-6,
    trainer_type="sft",  # or "dpo"
    enable_tracking=True,
    trackio_url="https://huggingface.co/spaces/username/trackio-monitoring"
)

Dataset Processing (src/data.py)

The SmolLM3Dataset class handles multiple dataset formats:

from src.data import SmolLM3Dataset

# Supports multiple formats:
# 1. Chat format (recommended)
# 2. Instruction format
# 3. User-Assistant format
# 4. Hugging Face datasets

dataset = SmolLM3Dataset(
    data_path="my_dataset",
    tokenizer=tokenizer,
    max_seq_length=4096,
    use_chat_template=True,
    sample_size=80000  # For lightweight training
)

Training Orchestration (src/train.py)

The main training script coordinates all components:

from src.train import main
from src.model import SmolLM3Model
from src.trainer import SmolLM3Trainer, SmolLM3DPOTrainer

# SFT Training
trainer = SmolLM3Trainer(
    model=model,
    dataset=dataset,
    config=config,
    output_dir="./outputs"
)

# DPO Training
dpo_trainer = SmolLM3DPOTrainer(
    model=model,
    dataset=dataset,
    config=config,
    output_dir="./dpo-outputs"
)

🎯 Training Types

Supervised Fine-tuning (SFT)

Standard instruction tuning for improving model capabilities:

python scripts/training/train.py \
    --config config/train_smollm3.py \
    --trainer-type sft \
    --experiment-name "sft-experiment"

Direct Preference Optimization (DPO)

Preference-based training for alignment:

python scripts/training/train.py \
    --config config/train_smollm3_dpo.py \
    --trainer-type dpo \
    --experiment-name "dpo-experiment"

πŸ“Š Monitoring & Tracking

Trackio Integration

The pipeline includes comprehensive monitoring:

from src.monitoring import create_monitor_from_config

monitor = create_monitor_from_config(config)
monitor.log_metrics({
    "train_loss": loss,
    "learning_rate": lr,
    "gradient_norm": grad_norm
})

HF Dataset Integration

Experiment data is automatically saved to HF Datasets:

# Automatically configured in launch.sh
dataset_repo = "username/trackio-experiments"

πŸ”„ Model Management

Pushing to HF Hub

python scripts/model_tonic/push_to_huggingface.py \
    ./outputs username/model-name \
    --token YOUR_HF_TOKEN \
    --trackio-url "https://huggingface.co/spaces/username/trackio-monitoring" \
    --experiment-name "my-experiment"

Model Quantization

Create optimized versions for deployment:

# Quantize and push to HF Hub
python scripts/model_tonic/quantize_standalone.py \
    ./outputs username/model-name \
    --quant-type int8_weight_only \
    --token YOUR_HF_TOKEN

# Quantize for CPU deployment
python scripts/model_tonic/quantize_standalone.py \
    ./outputs username/model-name \
    --quant-type int4_weight_only \
    --device cpu \
    --save-only

πŸ› οΈ Customization Guide

Adding New Training Configurations

  1. Create a new config file in config/:
# config/train_smollm3_custom.py
from config.train_smollm3 import SmolLM3Config

config = SmolLM3Config(
    model_name="HuggingFaceTB/SmolLM3-3B-Instruct",
    max_seq_length=16384,
    batch_size=4,
    learning_rate=1e-5,
    max_iters=2000,
    trainer_type="sft"
)
  1. Add to the training script mapping in scripts/training/train.py:
config_map = {
    # ... existing configs ...
    "config/train_smollm3_custom.py": get_custom_config,
}

Custom Dataset Formats

Extend src/data.py to support new formats:

def _load_custom_format(self, data_path: str) -> Dataset:
    """Load custom dataset format"""
    # Your custom loading logic here
    pass

Custom Training Loops

Extend src/trainer.py for specialized training:

class SmolLM3CustomTrainer(SmolLM3Trainer):
    def training_step(self, batch):
        # Custom training logic
        pass

πŸ”§ Development & Contributing

Project Structure

  • src/: Core training modules
  • config/: Training configurations
  • scripts/: Utility scripts and automation
  • docs/: Comprehensive documentation
  • tests/: Test files and debugging tools

Adding New Features

  1. Configuration: Add to config/ directory
  2. Core Logic: Extend modules in src/
  3. Scripts: Add utility scripts to scripts/
  4. Documentation: Update relevant docs in docs/
  5. Tests: Add test files to tests/

Testing Your Changes

# Run basic tests
python tests/test_config.py
python tests/test_dataset.py
python tests/test_training.py

# Test specific components
python tests/test_monitoring.py
python tests/test_model_push.py

Code Style

  • Follow PEP 8 for Python code
  • Use type hints for all functions
  • Add comprehensive docstrings
  • Include error handling for external APIs
  • Use structured logging with consistent field names

🚨 Troubleshooting

Common Issues

  1. Out of Memory (OOM)

    # Reduce batch size in config
    batch_size=2  # instead of 8
    gradient_accumulation_steps=16  # increase to compensate
    
  2. Token Validation Errors

    # Validate your HF token
    python scripts/validate_hf_token.py YOUR_TOKEN
    
  3. Dataset Loading Issues

    # Check dataset format
    python tests/test_dataset_loading.py
    

Debug Mode

Enable detailed logging:

import logging
logging.basicConfig(level=logging.DEBUG)

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes following the code style
  4. Add tests for new functionality
  5. Update documentation
  6. Submit a pull request

πŸ“„ License

This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.

πŸ”— Resources