File size: 12,013 Bytes
93d0081
 
 
 
 
 
7aa5e49
93d0081
 
68b0972
93d0081
 
ce0d824
78a7472
ce0d824
 
 
 
 
 
 
 
3c37508
d8dd7a1
fcf2981
d8dd7a1
26641fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8275b3
 
3c37508
d8dd7a1
a8275b3
3c37508
 
 
fcf2981
3c37508
 
 
 
 
 
 
 
fcf2981
d8dd7a1
3c37508
d8dd7a1
3c37508
d8dd7a1
3c37508
d8dd7a1
3c37508
 
d8dd7a1
 
3c37508
 
fcf2981
3c37508
 
 
 
 
d8dd7a1
3c37508
d8dd7a1
3c37508
d8dd7a1
3c37508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8dd7a1
 
 
3c37508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a552387
 
 
d8dd7a1
 
 
3c37508
d8dd7a1
3c37508
d8dd7a1
3c37508
d8dd7a1
 
 
 
 
 
3c37508
d8dd7a1
3c37508
 
 
 
 
d8dd7a1
 
 
3c37508
d8dd7a1
3c37508
d8dd7a1
 
3c37508
 
 
 
 
 
 
 
 
 
 
 
 
 
d8dd7a1
 
 
3c37508
d8dd7a1
3c37508
d8dd7a1
 
3c37508
 
 
d8dd7a1
3c37508
 
d8dd7a1
 
 
3c37508
d8dd7a1
 
3c37508
 
 
 
 
 
 
d8dd7a1
 
3c37508
d8dd7a1
3c37508
d8dd7a1
3c37508
d8dd7a1
3c37508
 
 
 
 
 
d8dd7a1
3c37508
d8dd7a1
3c37508
d8dd7a1
3c37508
 
 
 
 
 
d8dd7a1
3c37508
d8dd7a1
3c37508
d8dd7a1
3c37508
d8dd7a1
 
3c37508
 
 
 
 
 
 
 
d8dd7a1
 
3c37508
d8dd7a1
3c37508
d8dd7a1
 
3c37508
 
40fd629
 
3c37508
40fd629
3c37508
40fd629
 
3c37508
 
 
 
 
40fd629
 
3c37508
40fd629
3c37508
40fd629
 
3c37508
 
 
40fd629
 
 
3c37508
 
 
40fd629
 
 
 
 
3c37508
 
 
 
 
40fd629
 
3c37508
 
40fd629
3c37508
 
 
 
 
 
 
40fd629
3c37508
40fd629
3c37508
 
 
 
 
 
 
40fd629
 
3c37508
40fd629
3c37508
40fd629
3c37508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40fd629
3c37508
40fd629
3c37508
40fd629
3c37508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8dd7a1
 
3c37508
 
 
 
 
 
 
 
d8dd7a1
 
3c37508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8dd7a1
 
3c37508
d8dd7a1
3c37508
 
 
 
 
 
d8dd7a1
3c37508
d8dd7a1
3c37508
5fe83da
3c37508
5fe83da
3c37508
 
 
7aa5e49
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
---
title: SmolFactory
emoji: 🏭
colorFrom: blue
colorTo: pink
sdk: gradio
sdk_version: 5.42.0
app_file: interface.py
pinned: false
short_description: SmolFactory is a e2e model maker
---

<p align="center">
        πŸ€— <a href="https://hf.co/spaces/Tonic/smolfactory">Hugging Face</a>&nbsp&nbsp | &nbsp&nbspπŸ€– <a href="https://huggingface.co/spaces/Tonic/Petite-LLM-3">demo</a>&nbsp&nbsp | &nbsp&nbsp πŸ“‘ <a href="https://huggingface.co/blog/Tonic/SmolFactory">Blog</a> &nbsp&nbsp | &nbsp&nbspπŸ–₯️ <a href="https://huggingface.co/Tonic/petite-elle-L-aime-3-sft">Model</a>
<br>
<a href="https://huggingface.co/spaces/Tonic/Track-Tonic">Monitoring</a>&nbsp&nbsp | &nbsp&nbsp
<a href="https://discord.gg/qdfnvSPcqP">
  <img src="https://img.shields.io/discord/1109943800132010065?label=Discord&logo=discord&style=flat-square" alt="Join us on Discord">
</a>&nbsp&nbsp |  &nbsp&nbsp<a href="https://huggingface.co/datasets/Tonic/trackio-experiments">Dataset</a> 
</p>


# 🀏🏻🏭SmolFactory

SmolFactory helps you train, monitor and deploy your SmolLM3 and GPT-OSS fine-tunes, and more!

<table>
  <tr>
    <td>
      <img width="100%" src="https://github.com/user-attachments/assets/42d5df5f-acaa-40dc-ac4a-34153b5c9675" alt="Screenshot 1" />
    </td>
    <td>
      <img width="100%" src="https://github.com/user-attachments/assets/ed3c99b3-1335-4ebd-807e-25db795e751b" alt="Screenshot 2" />
    </td>
    <td>
      <img width="100%" src="https://github.com/user-attachments/assets/c557500a-1a08-4efa-9b17-1afe1101d71a" alt="Screenshot 3" />
    </td>
  </tr>
</table>


Train and deploy your model with one simple command !

## πŸ€– Automatically Push Model, Spaces, Datasets & Monitoring

- **Automatic Deployment**: Spaces created and configured automatically during the pipeline
- **Trackio Monitoring Space**: Real-time training metrics, loss curves, and resource utilization
- **Demo Spaces**: Instant web interfaces for model testing and demonstration
- **Real-time Metrics**: Live training loss, learning rate, gradient norms, and GPU utilization
- **Custom Dashboards**: Tailored visualizations for SmolLM3 and GPT-OSS fine-tuning
- **Artifact Logging**: Model checkpoints, configuration files, and training logs
- **Experiment Comparison**: Side-by-side analysis of different training runs
- **Alert System**: Notifications for training issues or completion
- **Integration**: Seamless connection with HF Spaces for public monitoring
- **Experiment Tracking**: All training data, metrics, and artifacts stored in HF Datasets
- **Reproducibility**: Complete experiment history with configuration snapshots
- **Collaboration**: Easy sharing of training results and model comparisons
- **Version Control**: Track dataset changes and model performance over time
- **GPT-OSS Support**: Specialized configurations for OpenAI's GPT-OSS-20B model with LoRA and multilingual reasoning

## πŸš€ Quick Start

### Interactive Pipeline (Recommended)

The easiest way to get started is using the interactive pipeline:

```bash
./launch.sh
```

This script will:
1. **Authenticate** with Hugging Face (write + read tokens)
2. **Configure** training parameters interactively (SmolLM3 or GPT-OSS)
3. **Deploy** Trackio Space for monitoring
4. **Setup** HF Dataset for experiment tracking
5. **Execute** training with your chosen configuration
6. **Push** model to HF Hub with comprehensive documentation
7. **Deploy** demo space for testing (optional)

### Manual Setup

For advanced users who want to customize the pipeline:

```bash
# 1. Install dependencies
pip install -r requirements/requirements_core.txt

# 2. Configure your training
python scripts/training/train.py \
    --config config/train_smollm3_h100_lightweight.py \
    --experiment-name "my-experiment" \
    --output-dir ./outputs \
    --trackio-url "https://huggingface.co/spaces/username/trackio-monitoring"

# 3. Push model to HF Hub
python scripts/model_tonic/push_to_huggingface.py \
    ./outputs username/model-name \
    --token YOUR_HF_TOKEN
```


## πŸ—οΈ Repository Architecture

```mermaid
graph LR
    Entry_Point["Entry Point"]
    Configuration_Management["Configuration Management"]
    Data_Pipeline["Data Pipeline"]
    Model_Abstraction["Model Abstraction"]
    Training_Orchestrator["Training Orchestrator"]
    Entry_Point -- "Initializes and Uses" --> Configuration_Management
    Entry_Point -- "Initializes" --> Data_Pipeline
    Entry_Point -- "Initializes" --> Model_Abstraction
    Entry_Point -- "Initializes and Invokes" --> Training_Orchestrator
    Configuration_Management -- "Provides Configuration To" --> Model_Abstraction
    Configuration_Management -- "Provides Configuration To" --> Data_Pipeline
    Configuration_Management -- "Provides Configuration To" --> Training_Orchestrator
    Data_Pipeline -- "Provides Data To" --> Training_Orchestrator
    Model_Abstraction -- "Provides Model To" --> Training_Orchestrator
    click Entry_Point href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Entry_Point.md" "Details"
    click Configuration_Management href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Configuration_Management.md" "Details"
    click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Data_Pipeline.md" "Details"
    click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
    click Training_Orchestrator href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Training_Orchestrator.md" "Details"
```


## πŸ”§ Core Components

### Configuration System (`config/`)

All training configurations inherit from `SmolLM3Config`:

```python
# config/my_config.py
from config.train_smollm3 import SmolLM3Config

config = SmolLM3Config(
    model_name="HuggingFaceTB/SmolLM3-3B",
    max_seq_length=8192,
    batch_size=8,
    learning_rate=5e-6,
    trainer_type="sft",  # or "dpo"
    enable_tracking=True,
    trackio_url="https://huggingface.co/spaces/username/trackio-monitoring"
)
```

### Dataset Processing (`src/data.py`)

The `SmolLM3Dataset` class handles multiple dataset formats:

```python
from src.data import SmolLM3Dataset

# Supports multiple formats:
# 1. Chat format (recommended)
# 2. Instruction format
# 3. User-Assistant format
# 4. Hugging Face datasets

dataset = SmolLM3Dataset(
    data_path="my_dataset",
    tokenizer=tokenizer,
    max_seq_length=4096,
    use_chat_template=True,
    sample_size=80000  # For lightweight training
)
```

### Training Orchestration (`src/train.py`)

The main training script coordinates all components:

```python
from src.train import main
from src.model import SmolLM3Model
from src.trainer import SmolLM3Trainer, SmolLM3DPOTrainer

# SFT Training
trainer = SmolLM3Trainer(
    model=model,
    dataset=dataset,
    config=config,
    output_dir="./outputs"
)

# DPO Training
dpo_trainer = SmolLM3DPOTrainer(
    model=model,
    dataset=dataset,
    config=config,
    output_dir="./dpo-outputs"
)
```

## 🎯 Training Types

### Supervised Fine-tuning (SFT)

Standard instruction tuning for improving model capabilities:

```bash
python scripts/training/train.py \
    --config config/train_smollm3.py \
    --trainer-type sft \
    --experiment-name "sft-experiment"
```

### Direct Preference Optimization (DPO)

Preference-based training for alignment:

```bash
python scripts/training/train.py \
    --config config/train_smollm3_dpo.py \
    --trainer-type dpo \
    --experiment-name "dpo-experiment"
```

## πŸ“Š Monitoring & Tracking

### Trackio Integration

The pipeline includes comprehensive monitoring:

```python
from src.monitoring import create_monitor_from_config

monitor = create_monitor_from_config(config)
monitor.log_metrics({
    "train_loss": loss,
    "learning_rate": lr,
    "gradient_norm": grad_norm
})
```

### HF Dataset Integration

Experiment data is automatically saved to HF Datasets:

```python
# Automatically configured in launch.sh
dataset_repo = "username/trackio-experiments"
```

## πŸ”„ Model Management

### Pushing to HF Hub

```bash
python scripts/model_tonic/push_to_huggingface.py \
    ./outputs username/model-name \
    --token YOUR_HF_TOKEN \
    --trackio-url "https://huggingface.co/spaces/username/trackio-monitoring" \
    --experiment-name "my-experiment"
```

### Model Quantization

Create optimized versions for deployment:

```bash
# Quantize and push to HF Hub
python scripts/model_tonic/quantize_standalone.py \
    ./outputs username/model-name \
    --quant-type int8_weight_only \
    --token YOUR_HF_TOKEN

# Quantize for CPU deployment
python scripts/model_tonic/quantize_standalone.py \
    ./outputs username/model-name \
    --quant-type int4_weight_only \
    --device cpu \
    --save-only
```

## πŸ› οΈ Customization Guide

### Adding New Training Configurations

1. Create a new config file in `config/`:

```python
# config/train_smollm3_custom.py
from config.train_smollm3 import SmolLM3Config

config = SmolLM3Config(
    model_name="HuggingFaceTB/SmolLM3-3B-Instruct",
    max_seq_length=16384,
    batch_size=4,
    learning_rate=1e-5,
    max_iters=2000,
    trainer_type="sft"
)
```

2. Add to the training script mapping in `scripts/training/train.py`:

```python
config_map = {
    # ... existing configs ...
    "config/train_smollm3_custom.py": get_custom_config,
}
```

### Custom Dataset Formats

Extend `src/data.py` to support new formats:

```python
def _load_custom_format(self, data_path: str) -> Dataset:
    """Load custom dataset format"""
    # Your custom loading logic here
    pass
```

### Custom Training Loops

Extend `src/trainer.py` for specialized training:

```python
class SmolLM3CustomTrainer(SmolLM3Trainer):
    def training_step(self, batch):
        # Custom training logic
        pass
```

## πŸ”§ Development & Contributing

### Project Structure

- **`src/`**: Core training modules
- **`config/`**: Training configurations
- **`scripts/`**: Utility scripts and automation
- **`docs/`**: Comprehensive documentation
- **`tests/`**: Test files and debugging tools

### Adding New Features

1. **Configuration**: Add to `config/` directory
2. **Core Logic**: Extend modules in `src/`
3. **Scripts**: Add utility scripts to `scripts/`
4. **Documentation**: Update relevant docs in `docs/`
5. **Tests**: Add test files to `tests/`

### Testing Your Changes

```bash
# Run basic tests
python tests/test_config.py
python tests/test_dataset.py
python tests/test_training.py

# Test specific components
python tests/test_monitoring.py
python tests/test_model_push.py
```

### Code Style

- Follow PEP 8 for Python code
- Use type hints for all functions
- Add comprehensive docstrings
- Include error handling for external APIs
- Use structured logging with consistent field names

## 🚨 Troubleshooting

### Common Issues

1. **Out of Memory (OOM)**
   ```bash
   # Reduce batch size in config
   batch_size=2  # instead of 8
   gradient_accumulation_steps=16  # increase to compensate
   ```

2. **Token Validation Errors**
   ```bash
   # Validate your HF token
   python scripts/validate_hf_token.py YOUR_TOKEN
   ```

3. **Dataset Loading Issues**
   ```bash
   # Check dataset format
   python tests/test_dataset_loading.py
   ```

### Debug Mode

Enable detailed logging:

```python
import logging
logging.basicConfig(level=logging.DEBUG)
```

## 🀝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes following the code style
4. Add tests for new functionality
5. Update documentation
6. Submit a pull request

## πŸ“„ License

This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.

## πŸ”— Resources

- [SmolLM3 Blog Post](https://huggingface.co/blog/smollm3)
- [Model Repository](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- [GitHub Repository](https://github.com/huggingface/smollm)
- [SmolTalk Dataset](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)