Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth
This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.
Prerequisites
- Python 3.10+
- macOS with Apple Silicon (M1/M2/M3)
- PyTorch with MPS backend (install via
pip install torch
) - All dependencies in
requirements.txt
(install withpip install -r requirements.txt
)
Training Script Usage
Run the training script with your dataset (JSON/JSONL or Hugging Face format):
python training/train_gemma_unsloth.py \
--job-id myjob \
--output-dir training_runs/myjob \
--dataset sample_data/train.jsonl \
--prompt-field prompt --response-field response \
--epochs 1 --batch-size 1 --gradient-accumulation 8 \
--use-fp16 \
--grpo --cpt \
--export-gguf --gguf-out training_runs/myjob/adapter-gguf-q4_k_xl
Flags:
--grpo
: Enable GRPO (if supported by Unsloth)--cpt
: Enable CPT (if supported by Unsloth)--export-gguf
: Export to GGUF Q4_K_XL after training--gguf-out
: Path to save GGUF file
Notes:
- On Mac, bitsandbytes/xformers are disabled automatically.
- Training is slower than on CUDA GPUs; use small batch sizes and gradient accumulation.
- If Unsloth's GGUF export is unavailable, follow the printed instructions to use llama.cpp's
convert-hf-to-gguf.py
.
Troubleshooting
- If you see errors about missing CUDA or bitsandbytes, ensure you are running on Apple Silicon and have the latest Unsloth/Transformers.
- For memory errors, reduce
--batch-size
or--cutoff-len
. - For best results, use datasets formatted to match the official Gemma 3n chat template.
Example: Manual GGUF Export with llama.cpp
If the script prints a message about manual conversion, run:
python convert-hf-to-gguf.py --outtype q4_k_xl --outfile training_runs/myjob/adapter-gguf-q4_k_xl training_runs/myjob/adapter
References
title: Multimodal AI Backend Service emoji: π colorFrom: yellow colorTo: purple sdk: docker app_port: 8000 pinned: false
firstAI - Multimodal AI Backend π
A powerful AI backend service with multimodal capabilities and advanced deployment support - supporting both text generation and image analysis using transformers pipelines.
π Features
π€ Configurable AI Models
- Default Text Model: Microsoft DialoGPT-medium (deployment-friendly)
- Advanced Models: Support for quantized models (Unsloth, 4-bit, GGUF)
- Environment Configuration: Runtime model selection via environment variables
- Quantization Support: Automatic 4-bit quantization with fallback mechanisms
πΌοΈ Multimodal Support
- Process text-only messages
- Analyze images from URLs
- Combined image + text conversations
- OpenAI Vision API compatible format
οΏ½ Production Ready
- Enhanced Deployment: Multi-level fallback for quantized models
- Environment Flexibility: Works in constrained deployment environments
- Error Resilience: Comprehensive error handling with graceful degradation
- FastAPI backend with automatic docs
- Health checks and monitoring
- PyTorch with MPS acceleration (Apple Silicon)
π§ Model Configuration
Configure models via environment variables:
# Set custom text model (optional)
export AI_MODEL="microsoft/DialoGPT-medium"
# Set custom vision model (optional)
export VISION_MODEL="Salesforce/blip-image-captioning-base"
# For private models (optional)
export HF_TOKEN="your_huggingface_token"
Supported Model Types:
- Standard models:
microsoft/DialoGPT-medium
,deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
- Quantized models:
unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit
- GGUF models:
unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF
π Quick Start
1. Install Dependencies
pip install -r requirements.txt
2. Start the Service
python backend_service.py
3. Test Multimodal Capabilities
python test_final.py
The service will start on http://localhost:8001 with both text and vision models loaded.
π‘ Usage Examples
Text-Only Chat
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/DialoGPT-medium",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Image Analysis
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Salesforce/blip-image-captioning-base",
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://example.com/image.jpg"
}
]
}
]
}'
Multimodal (Image + Text)
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Salesforce/blip-image-captioning-base",
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://example.com/image.jpg"
},
{
"type": "text",
"text": "What do you see in this image?"
}
]
}
]
}'
π§ Technical Details
Architecture
- FastAPI web framework
- Transformers pipeline for AI models
- PyTorch backend with GPU/MPS support
- Pydantic for request/response validation
Models
- Text: microsoft/DialoGPT-medium
- Vision: Salesforce/blip-image-captioning-base
API Endpoints
GET /
- Service informationGET /health
- Health checkGET /v1/models
- List available modelsPOST /v1/chat/completions
- Chat completions (text/multimodal)GET /docs
- Interactive API documentation
π Deployment
Environment Variables
# Optional: Custom models
export AI_MODEL="microsoft/DialoGPT-medium"
export VISION_MODEL="Salesforce/blip-image-captioning-base"
export HF_TOKEN="your_token_here" # For private models
Production Deployment
The service includes enhanced deployment capabilities:
- Quantized Model Support: Automatic handling of 4-bit and GGUF models
- Fallback Mechanisms: Multi-level fallback for constrained environments
- Error Resilience: Graceful degradation when quantization libraries unavailable
Docker Deployment
# Build and run with Docker
docker build -t firstai .
docker run -p 8000:8000 firstai
Testing Deployment
# Test quantization detection and fallbacks
python test_deployment_fallbacks.py
# Test health endpoint
curl http://localhost:8000/health
For comprehensive deployment guidance, see DEPLOYMENT_ENHANCEMENTS.md
.
π§ͺ Testing
Run the comprehensive test suite:
python test_final.py
Test individual components:
python test_multimodal.py # Basic multimodal tests
python test_pipeline.py # Pipeline compatibility
π¦ Dependencies
Key packages:
fastapi
- Web frameworktransformers
- AI model pipelinestorch
- PyTorch backendPillow
- Image processingaccelerate
- Model accelerationrequests
- HTTP client
π― Integration Complete
This project successfully integrates:
β
Transformers image-text-to-text pipeline
β
OpenAI Vision API compatibility
β
Multimodal message processing
β
Production-ready FastAPI service
See MULTIMODAL_INTEGRATION_COMPLETE.md
for detailed integration documentation.
- PyTorch with MPS acceleration (Apple Silicon) AI Backend Service emoji: οΏ½ colorFrom: yellow colorTo: purple sdk: fastapi sdk_version: 0.100.0 app_file: backend_service.py pinned: false
AI Backend Service π
Status: β CONVERSION COMPLETE!
Successfully converted from a non-functioning Gradio HuggingFace app to a production-ready FastAPI backend service with OpenAI-compatible API endpoints.
Quick Start
1. Setup Environment
# Activate the virtual environment
source gradio_env/bin/activate
# Install dependencies (already done)
pip install -r requirements.txt
2. Start the Backend Service
python backend_service.py --port 8000 --reload
3. Test the API
# Run comprehensive tests
python test_api.py
# Or try usage examples
python usage_examples.py
API Endpoints
Endpoint | Method | Description |
---|---|---|
/ |
GET | Service information |
/health |
GET | Health check |
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | Chat completion (OpenAI compatible) |
/v1/completions |
POST | Text completion |
Example Usage
Chat Completion
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/DialoGPT-medium",
"messages": [
{"role": "user", "content": "Hello! How are you?"}
],
"max_tokens": 150,
"temperature": 0.7
}'
Streaming Chat
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/DialoGPT-medium",
"messages": [
{"role": "user", "content": "Tell me a joke"}
],
"stream": true
}'
Files
app.py
- Original Gradio ChatInterface (still functional)backend_service.py
- New FastAPI backend service βtest_api.py
- Comprehensive API testingusage_examples.py
- Simple usage examplesrequirements.txt
- Updated dependenciesCONVERSION_COMPLETE.md
- Detailed conversion documentation
Features
β
OpenAI-Compatible API - Drop-in replacement for OpenAI API
β
Async FastAPI - High-performance async architecture
β
Streaming Support - Real-time response streaming
β
Error Handling - Robust error handling with fallbacks
β
Production Ready - CORS, logging, health checks
β
Docker Ready - Easy containerization
β
Auto-reload - Development-friendly auto-reload
β
Type Safety - Full type hints with Pydantic validation
Service URLs
- Backend Service: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- OpenAPI Spec: http://localhost:8000/openapi.json
Model Information
- Current Model:
microsoft/DialoGPT-medium
- Type: Conversational AI model
- Provider: HuggingFace Inference API
- Capabilities: Text generation, chat completion
Architecture
βββββββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββββββ
β Client Request βββββΆβ FastAPI Backend βββββΆβ HuggingFace API β
β (OpenAI format) β β (backend_service) β β (DialoGPT-medium) β
βββββββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β OpenAI Response β
β (JSON/Streaming) β
ββββββββββββββββββββββββ
Development
The service includes:
- Auto-reload for development
- Comprehensive logging for debugging
- Type checking for code quality
- Test suite for reliability
- Error handling for robustness
Production Deployment
Ready for production with:
- Environment variables for configuration
- Health check endpoints for monitoring
- CORS support for web applications
- Docker compatibility for containerization
- Structured logging for observability
π Conversion Status: COMPLETE!
Successfully transformed from broken Gradio app to production-ready AI backend service.
For detailed conversion documentation, see CONVERSION_COMPLETE.md
.
Gemma 3n GGUF FastAPI Backend (Hugging Face Space)
This Space provides an OpenAI-compatible chat API for Gemma 3n GGUF models, powered by FastAPI.
Note: On Hugging Face Spaces, the backend runs in DEMO_MODE
(no model loaded) for demonstration and endpoint testing. For real inference, run locally with a GGUF model and llama-cpp-python.
Endpoints
/health
β Health check/v1/chat/completions
β OpenAI-style chat completions (returns demo response)/train/start
β Start a (demo) training job/train/status/{job_id}
β Check training job status/train/logs/{job_id}
β Get training logs
Usage
- Clone this repo or create a Hugging Face Space (type: FastAPI).
- All dependencies are in
requirements.txt
. - The Space will start in demo mode (no model download required).
Local Inference (with GGUF)
To run with a real model locally:
- Download a Gemma 3n GGUF model (e.g. from https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF).
- Set
AI_MODEL
to the local path or repo. - Unset
DEMO_MODE
. - Run:
pip install -r requirements.txt uvicorn gemma_gguf_backend:app --host 0.0.0.0 --port 8000
License
Apache 2.0