Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth

This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.

Prerequisites

Python 3.10+
macOS with Apple Silicon (M1/M2/M3)
PyTorch with MPS backend (install via pip install torch)
All dependencies in requirements.txt (install with pip install -r requirements.txt)

Training Script Usage

Run the training script with your dataset (JSON/JSONL or Hugging Face format):

python training/train_gemma_unsloth.py \
  --job-id myjob \
  --output-dir training_runs/myjob \
  --dataset sample_data/train.jsonl \
  --prompt-field prompt --response-field response \
  --epochs 1 --batch-size 1 --gradient-accumulation 8 \
  --use-fp16 \
  --grpo --cpt \
  --export-gguf --gguf-out training_runs/myjob/adapter-gguf-q4_k_xl

Flags:

--grpo: Enable GRPO (if supported by Unsloth)
--cpt: Enable CPT (if supported by Unsloth)
--export-gguf: Export to GGUF Q4_K_XL after training
--gguf-out: Path to save GGUF file

Notes:

On Mac, bitsandbytes/xformers are disabled automatically.
Training is slower than on CUDA GPUs; use small batch sizes and gradient accumulation.
If Unsloth's GGUF export is unavailable, follow the printed instructions to use llama.cpp's convert-hf-to-gguf.py.

Troubleshooting

If you see errors about missing CUDA or bitsandbytes, ensure you are running on Apple Silicon and have the latest Unsloth/Transformers.
For memory errors, reduce --batch-size or --cutoff-len.
For best results, use datasets formatted to match the official Gemma 3n chat template.

Example: Manual GGUF Export with llama.cpp

If the script prints a message about manual conversion, run:

python convert-hf-to-gguf.py --outtype q4_k_xl --outfile training_runs/myjob/adapter-gguf-q4_k_xl training_runs/myjob/adapter

References

title: Multimodal AI Backend Service emoji: 🚀 colorFrom: yellow colorTo: purple sdk: docker app_port: 8000 pinned: false

firstAI - Multimodal AI Backend 🚀

A powerful AI backend service with multimodal capabilities and advanced deployment support - supporting both text generation and image analysis using transformers pipelines.

🎉 Features

🤖 Configurable AI Models

Default Text Model: Microsoft DialoGPT-medium (deployment-friendly)
Advanced Models: Support for quantized models (Unsloth, 4-bit, GGUF)
Environment Configuration: Runtime model selection via environment variables
Quantization Support: Automatic 4-bit quantization with fallback mechanisms

🖼️ Multimodal Support

Process text-only messages
Analyze images from URLs
Combined image + text conversations
OpenAI Vision API compatible format

� Production Ready

Enhanced Deployment: Multi-level fallback for quantized models
Environment Flexibility: Works in constrained deployment environments
Error Resilience: Comprehensive error handling with graceful degradation
FastAPI backend with automatic docs
Health checks and monitoring
PyTorch with MPS acceleration (Apple Silicon)

🔧 Model Configuration

Configure models via environment variables:

# Set custom text model (optional)
export AI_MODEL="microsoft/DialoGPT-medium"

# Set custom vision model (optional)
export VISION_MODEL="Salesforce/blip-image-captioning-base"

# For private models (optional)
export HF_TOKEN="your_huggingface_token"

Supported Model Types:

Standard models: microsoft/DialoGPT-medium, deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
Quantized models: unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit
GGUF models: unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF

🚀 Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Start the Service

python backend_service.py

3. Test Multimodal Capabilities

python test_final.py

The service will start on http://localhost:8001 with both text and vision models loaded.

💡 Usage Examples

Text-Only Chat

curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Image Analysis

curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Salesforce/blip-image-captioning-base",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "url": "https://example.com/image.jpg"
          }
        ]
      }
    ]
  }'

Multimodal (Image + Text)

curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Salesforce/blip-image-captioning-base",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "url": "https://example.com/image.jpg"
          },
          {
            "type": "text",
            "text": "What do you see in this image?"
          }
        ]
      }
    ]
  }'

🔧 Technical Details

Architecture

FastAPI web framework
Transformers pipeline for AI models
PyTorch backend with GPU/MPS support
Pydantic for request/response validation

Models

Text: microsoft/DialoGPT-medium
Vision: Salesforce/blip-image-captioning-base

API Endpoints

GET / - Service information
GET /health - Health check
GET /v1/models - List available models
POST /v1/chat/completions - Chat completions (text/multimodal)
GET /docs - Interactive API documentation

🚀 Deployment

Environment Variables

# Optional: Custom models
export AI_MODEL="microsoft/DialoGPT-medium"
export VISION_MODEL="Salesforce/blip-image-captioning-base"
export HF_TOKEN="your_token_here"  # For private models

Production Deployment

The service includes enhanced deployment capabilities:

Quantized Model Support: Automatic handling of 4-bit and GGUF models
Fallback Mechanisms: Multi-level fallback for constrained environments
Error Resilience: Graceful degradation when quantization libraries unavailable

Docker Deployment

# Build and run with Docker
docker build -t firstai .
docker run -p 8000:8000 firstai

Testing Deployment

# Test quantization detection and fallbacks
python test_deployment_fallbacks.py

# Test health endpoint
curl http://localhost:8000/health

For comprehensive deployment guidance, see DEPLOYMENT_ENHANCEMENTS.md.

🧪 Testing

Run the comprehensive test suite:

python test_final.py

Test individual components:

python test_multimodal.py  # Basic multimodal tests
python test_pipeline.py    # Pipeline compatibility

📦 Dependencies

Key packages:

fastapi - Web framework
transformers - AI model pipelines
torch - PyTorch backend
Pillow - Image processing
accelerate - Model acceleration
requests - HTTP client

🎯 Integration Complete

This project successfully integrates: ✅ Transformers image-text-to-text pipeline
✅ OpenAI Vision API compatibility
✅ Multimodal message processing
✅ Production-ready FastAPI service

See MULTIMODAL_INTEGRATION_COMPLETE.md for detailed integration documentation.

PyTorch with MPS acceleration (Apple Silicon) AI Backend Service emoji: � colorFrom: yellow colorTo: purple sdk: fastapi sdk_version: 0.100.0 app_file: backend_service.py pinned: false

AI Backend Service 🚀

Status: ✅ CONVERSION COMPLETE!

Successfully converted from a non-functioning Gradio HuggingFace app to a production-ready FastAPI backend service with OpenAI-compatible API endpoints.

Quick Start

1. Setup Environment

# Activate the virtual environment
source gradio_env/bin/activate

# Install dependencies (already done)
pip install -r requirements.txt

2. Start the Backend Service

python backend_service.py --port 8000 --reload

3. Test the API

# Run comprehensive tests
python test_api.py

# Or try usage examples
python usage_examples.py

API Endpoints

Endpoint	Method	Description
`/`	GET	Service information
`/health`	GET	Health check
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	Chat completion (OpenAI compatible)
`/v1/completions`	POST	Text completion

Example Usage

Chat Completion

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 150,
    "temperature": 0.7
  }'

Streaming Chat

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Tell me a joke"}
    ],
    "stream": true
  }'

Files

app.py - Original Gradio ChatInterface (still functional)
backend_service.py - New FastAPI backend service ⭐
test_api.py - Comprehensive API testing
usage_examples.py - Simple usage examples
requirements.txt - Updated dependencies
CONVERSION_COMPLETE.md - Detailed conversion documentation

Features

✅ OpenAI-Compatible API - Drop-in replacement for OpenAI API
✅ Async FastAPI - High-performance async architecture
✅ Streaming Support - Real-time response streaming
✅ Error Handling - Robust error handling with fallbacks
✅ Production Ready - CORS, logging, health checks
✅ Docker Ready - Easy containerization
✅ Auto-reload - Development-friendly auto-reload
✅ Type Safety - Full type hints with Pydantic validation

Service URLs

Backend Service: http://localhost:8000
API Documentation: http://localhost:8000/docs
OpenAPI Spec: http://localhost:8000/openapi.json

Model Information

Current Model: microsoft/DialoGPT-medium
Type: Conversational AI model
Provider: HuggingFace Inference API
Capabilities: Text generation, chat completion

Architecture

┌─────────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐
│   Client Request    │───▶│   FastAPI Backend    │───▶│  HuggingFace API    │
│  (OpenAI format)    │    │  (backend_service)   │    │  (DialoGPT-medium)  │
└─────────────────────┘    └──────────────────────┘    └─────────────────────┘
                                       │
                                       ▼
                           ┌──────────────────────┐
                           │   OpenAI Response    │
                           │   (JSON/Streaming)   │
                           └──────────────────────┘

Development

The service includes:

Auto-reload for development
Comprehensive logging for debugging
Type checking for code quality
Test suite for reliability
Error handling for robustness

Production Deployment

Ready for production with:

Environment variables for configuration
Health check endpoints for monitoring
CORS support for web applications
Docker compatibility for containerization
Structured logging for observability

🎉 Conversion Status: COMPLETE!
Successfully transformed from broken Gradio app to production-ready AI backend service.

For detailed conversion documentation, see CONVERSION_COMPLETE.md.

Gemma 3n GGUF FastAPI Backend (Hugging Face Space)

This Space provides an OpenAI-compatible chat API for Gemma 3n GGUF models, powered by FastAPI.

Note: On Hugging Face Spaces, the backend runs in DEMO_MODE (no model loaded) for demonstration and endpoint testing. For real inference, run locally with a GGUF model and llama-cpp-python.

Endpoints

/health — Health check
/v1/chat/completions — OpenAI-style chat completions (returns demo response)
/train/start — Start a (demo) training job
/train/status/{job_id} — Check training job status
/train/logs/{job_id} — Get training logs

Usage

Clone this repo or create a Hugging Face Space (type: FastAPI).
All dependencies are in requirements.txt.
The Space will start in demo mode (no model download required).

Local Inference (with GGUF)

To run with a real model locally:

Download a Gemma 3n GGUF model (e.g. from https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF).
Set AI_MODEL to the local path or repo.
Unset DEMO_MODE.

Run:

pip install -r requirements.txt
uvicorn gemma_gguf_backend:app --host 0.0.0.0 --port 8000

License

Apache 2.0