Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / README.md

Tonic

adds formatting fix

ebe598e verified 3 months ago

preview code

raw

history blame

7.23 kB

	# SmolLM3 Fine-tuning

	This repository provides a complete setup for fine-tuning SmolLM3 models using the FlexAI console, following the nanoGPT structure but adapted for modern transformer models.

	## Overview

	SmolLM3 is a 3B-parameter transformer decoder model optimized for efficiency, long-context reasoning, and multilingual support. This setup allows you to fine-tune SmolLM3 for various tasks including:

	- Supervised Fine-tuning (SFT): Adapt the model for instruction following
	- Direct Preference Optimization (DPO): Improve model alignment
	- Long-context fine-tuning: Support for up to 128k tokens
	- Tool calling: Fine-tune for function calling capabilities

	## Quick Start

	### 1. Repository Setup

	The repository follows the FlexAI console structure with the following key files:

	- `train.py`: Main entry point script
	- `config/train_smollm3.py`: Default configuration
	- `model.py`: Model wrapper and loading
	- `data.py`: Dataset handling and preprocessing
	- `trainer.py`: Training loop and trainer setup
	- `requirements.txt`: Dependencies

	### 2. FlexAI Console Configuration

	When setting up a Fine Tuning Job in the FlexAI console, use these settings:

	#### Basic Configuration
	- Name: `smollm3-finetune`
	- Cluster: Your organization's designated cluster
	- Checkpoint: (Optional) Previous training job checkpoint
	- Node Count: 1
	- Accelerator Count: 1-8 (depending on your needs)

	#### Repository Settings
	- Repository URL: `https://github.com/your-username/flexai-finetune`
	- Repository Revision: `main`

	#### Dataset Configuration
	- Datasets: Your dataset (mounted under `/input`)
	- Mount Directory: `my_dataset`

	#### Entry Point
	```
	train.py config/train_smollm3.py --dataset_dir=my_dataset --init_from=resume --out_dir=/input-checkpoint --max_iters=1500
	```

	### 3. Dataset Format

	The script supports multiple dataset formats:

	#### Chat Format (Recommended)
	```json
	[
	{
	"messages": [
	{"role": "user", "content": "What is machine learning?"},
	{"role": "assistant", "content": "Machine learning is a subset of AI..."}
	]
	}
	]
	```

	#### Instruction Format
	```json
	[
	{
	"instruction": "What is machine learning?",
	"output": "Machine learning is a subset of AI..."
	}
	]
	```

	#### User-Assistant Format
	```json
	[
	{
	"user": "What is machine learning?",
	"assistant": "Machine learning is a subset of AI..."
	}
	]
	```

	### 4. Configuration Options

	The default configuration in `config/train_smollm3.py` includes:

	```python
	@dataclass
	class SmolLM3Config:
	# Model configuration
	model_name: str = "HuggingFaceTB/SmolLM3-3B"
	max_seq_length: int = 4096
	use_flash_attention: bool = True

	# Training configuration
	batch_size: int = 4
	gradient_accumulation_steps: int = 4
	learning_rate: float = 2e-5
	max_iters: int = 1000

	# Mixed precision
	fp16: bool = True
	bf16: bool = False
	```

	### 5. Command Line Arguments

	The `train.py` script accepts various arguments:

	```bash
	# Basic usage
	python train.py config/train_smollm3.py

	# With custom parameters
	python train.py config/train_smollm3.py \
	--dataset_dir=my_dataset \
	--out_dir=/output-checkpoint \
	--init_from=resume \
	--max_iters=1500 \
	--batch_size=8 \
	--learning_rate=1e-5 \
	--max_seq_length=8192
	```

	## Advanced Usage

	### 1. Custom Configuration

	Create a custom configuration file:

	```python
	# config/my_config.py
	from config.train_smollm3 import SmolLM3Config

	config = SmolLM3Config(
	model_name="HuggingFaceTB/SmolLM3-3B-Instruct",
	max_seq_length=8192,
	batch_size=2,
	learning_rate=1e-5,
	max_iters=2000,
	use_flash_attention=True,
	fp16=True
	)
	```

	### 2. Long-Context Fine-tuning

	For long-context tasks (up to 128k tokens):

	```python
	config = SmolLM3Config(
	max_seq_length=131072, # 128k tokens
	model_name="HuggingFaceTB/SmolLM3-3B",
	use_flash_attention=True,
	gradient_checkpointing=True
	)
	```

	### 3. DPO Training

	For preference optimization, use the DPO trainer:

	```python
	from trainer import SmolLM3DPOTrainer

	dpo_trainer = SmolLM3DPOTrainer(
	model=model,
	dataset=dataset,
	config=config,
	output_dir="./dpo-output"
	)

	dpo_trainer.train()
	```

	### 4. Tool Calling Fine-tuning

	Include tool calling examples in your dataset:

	```json
	[
	{
	"messages": [
	{"role": "user", "content": "What's the weather in New York?"},
	{"role": "assistant", "content": "<tool_call>\n<invoke name=\"get_weather\">\n<parameter name=\"location\">New York</parameter>\n</invoke>\n</tool_call>"},
	{"role": "tool", "content": "The weather in New York is 72°F and sunny."},
	{"role": "assistant", "content": "The weather in New York is currently 72°F and sunny."}
	]
	}
	]
	```

	## Model Variants

	SmolLM3 comes in several variants:

	- SmolLM3-3B-Base: Base model for general fine-tuning
	- SmolLM3-3B: Instruction-tuned model
	- SmolLM3-3B-Instruct: Enhanced instruction model
	- Quantized versions: Available for deployment

	## Hardware Requirements

	### Minimum Requirements
	- GPU: 16GB+ VRAM (for 3B model)
	- RAM: 32GB+ system memory
	- Storage: 50GB+ free space

	### Recommended
	- GPU: A100/H100 or similar
	- RAM: 64GB+ system memory
	- Storage: 100GB+ SSD

	## Troubleshooting

	### Common Issues

	1. Out of Memory (OOM)
	- Reduce `batch_size`
	- Increase `gradient_accumulation_steps`
	- Enable `gradient_checkpointing`
	- Use `fp16` or `bf16`

	2. Slow Training
	- Enable `flash_attention`
	- Use mixed precision (`fp16`/`bf16`)
	- Increase `dataloader_num_workers`

	3. Dataset Loading Issues
	- Check dataset format
	- Ensure proper JSON structure
	- Verify file permissions

	### Debug Mode

	Enable debug logging:

	```python
	import logging
	logging.basicConfig(level=logging.DEBUG)
	```

	## Evaluation

	After training, evaluate your model:

	```python
	from transformers import pipeline

	pipe = pipeline(
	task="text-generation",
	model="./output-checkpoint",
	device=0,
	max_new_tokens=256,
	do_sample=True,
	temperature=0.7
	)

	# Test the model
	messages = [{"role": "user", "content": "Explain gravity in simple terms."}]
	outputs = pipe(messages)
	print(outputs[0]["generated_text"][-1]["content"])
	```

	## Deployment

	### Using vLLM
	```bash
	vllm serve ./output-checkpoint --enable-auto-tool-choice
	```

	### Using llama.cpp
	```bash
	# Convert to GGUF format
	python -m llama_cpp.convert_model ./output-checkpoint --outfile model.gguf
	```

	## Resources

	- [SmolLM3 Blog Post](https://huggingface.co/blog/smollm3)
	- [Model Repository](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
	- [GitHub Repository](https://github.com/huggingface/smollm)
	- [SmolTalk Dataset](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)

	## License

	This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.


	{
	"id": "exp_20250718_195852",
	"name": "petit-elle-l-aime-3",
	"description": "SmolLM3 fine-tuning experiment",
	"created_at": "2025-07-18T19:58:52.689087",
	"status": "running",
	"metrics": [],
	"parameters": {},
	"artifacts": [],
	"logs": []
	}