ZamAI Pashto LLM Fine-tuning Workflow
This guide provides a comprehensive workflow for fine-tuning a Mistral-7B# After training completes, your model will be available at tasal9/ZamAI-Mistral-7B-Pashto
. Use it in your applications:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the model and tokenizer
model_name = "tasal9/ZamAI-Mistral-7B-Pashto"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name) Pashto data using Hugging Face AutoTrain.
## Overview of the Workflow
1. **Prepare the Dataset**: Standardize column names and split into train/test
2. **Check Dataset Format**: Validate dataset compatibility with AutoTrain
3. **Upload to Hugging Face**: Share dataset on the Hub for use with AutoTrain
4. **Start Fine-tuning**: Launch a training job with AutoTrain
5. **Evaluate Results**: Check model performance and iterate if needed
## Step 1: Prepare Your Data
Use the `prepare_for_autotrain.py` script to format your data for AutoTrain:
```bash
# Activate your virtual environment
source venv/bin/activate
# Install dependencies if needed
pip install datasets huggingface_hub tqdm
# Format data - choose either "text" or "instruction-response" format
python prepare_for_autotrain.py \
--input_file zamai_pashto_dataset.json \
--format_type instruction-response \
--push_to_hub \
--hub_dataset_name ZamAI_Pashto_Training \
--hf_token YOUR_HF_TOKEN_HERE
This script:
- Converts your data to a standardized format
- Splits into train/test sets
- Uploads to the Hugging Face Hub
Step 2: Verify Dataset Format
Use the check_dataset_format.py
script to validate that your dataset is ready for training:
# Check the uploaded dataset format
python check_dataset_format.py --dataset_path YOUR_USERNAME/ZamAI_Pashto_Training
Ensure that:
- The dataset has the expected columns (
text
orinstruction
/response
) - Both train and test splits are available
Step 3: Fine-tune with AutoTrain
Use the autotrain_finetune.py
script to launch a fine-tuning job on AutoTrain:
python autotrain_finetune.py \
--project_name "ZamAI-Mistral-7B-Pashto" \
--model "mistralai/Mistral-7B-v0.1" \
--dataset "tasal9/ZamAI_Pashto_Dataset" \
--text_column "instruction" \
--lr 2e-4 \
--epochs 3 \
--lora_r 16 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--hf_token "YOUR_HF_TOKEN_HERE"
Step 4: Monitor Training Progress
Visit the AutoTrain dashboard to monitor:
- Training logs
- Loss and evaluation metrics
- Resource utilization
Step 5: Evaluate and Iterate
After training, evaluate your model's performance using our evaluation script:
# Evaluate your fine-tuned model
python evaluate_and_iterate.py \
--model_name "tasal9/ZamAI-Mistral-7B-Pashto" \
--dataset_file "zamai_pashto_dataset.json" \
--num_samples 20 \
--output_file "evaluation_results.json"
The script will:
- Select diverse samples from different categories in your dataset
- Generate responses using your fine-tuned model
- Analyze performance and provide recommendations for improvement
- Save detailed results for further analysis
Based on the evaluation:
- Identify areas for improvement in your dataset or training setup
- Update your dataset by adding more examples where needed
- Iterate by adjusting:
- Dataset composition and cleaning
- Model hyperparameters (learning rate, epochs, etc.)
- LoRA parameters for more efficient fine-tuning
Parameters to Experiment With
Parameter | Range to Try | Impact |
---|---|---|
Learning Rate | 1e-5 to 5e-4 | Lower values are more stable but slower |
Epochs | 2-5 | More epochs may improve performance but risk overfitting |
LoRA rank (r) | 8-64 | Higher ranks give more capacity but use more memory |
LoRA alpha | 16-64 | Controls adaptation strength |
Using Your Fine-tuned Model
After training completes, your model will be available at YOUR_USERNAME/ZamAI-Mistral-7B-Pashto
. Use it in your applications:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the model and tokenizer
model_name = "YOUR_USERNAME/ZamAI-Mistral-7B-Pashto"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# For LoRA models, you need to merge the adapter (optional)
# from peft import PeftModel
# model = PeftModel.from_pretrained(model, model_name)
# model = model.merge_and_unload()
# Generate text
prompt = "ستاسو نوم څه دی؟" # What is your name?
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Troubleshooting
If you encounter issues:
- Dataset formatting problems: Check column names and dataset splits
- Out-of-memory errors: Reduce batch size or use LoRA with smaller rank
- Poor performance: Try cleaning your dataset, adding more examples, or using different training parameters
Pushing Files to GitHub
After fine-tuning, you'll want to organize and push your model files to GitHub. Use our automated scripts:
# Update the GitHub repository with fine-tuning files
./update_github.sh
This script will:
- Create the proper directory structure in
models/text-generation/ZamAI-Mistral-7B/
- Copy all relevant files (scripts, documentation, configs) to this directory
- Commit and push changes to GitHub
You can also use the combined update script to update both GitHub and Hugging Face Hub:
# Update both GitHub and Hugging Face Hub
./update_repos.sh
Manual Organization (if needed)
If you prefer to organize files manually:
Create the model directory if it doesn't exist:
mkdir -p models/text-generation/ZamAI-Mistral-7B
Copy relevant files to the model directory:
cp autotrain_finetune.py evaluate_and_iterate.py FINETUNING_WORKFLOW.md models/text-generation/ZamAI-Mistral-7B/
Add a model-specific README:
cp README.md models/text-generation/ZamAI-Mistral-7B/MODEL_README.md
Commit and push to GitHub:
git add models/ git commit -m "Add ZamAI-Mistral-7B fine-tuning files" git push origin main