# Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies. ## Prerequisites - Python 3.10+ - macOS with Apple Silicon (M1/M2/M3) - PyTorch with MPS backend (install via `pip install torch`) - All dependencies in `requirements.txt` (install with `pip install -r requirements.txt`) ## Training Script Usage Run the training script with your dataset (JSON/JSONL or Hugging Face format): ```bash python training/train_gemma_unsloth.py \ --job-id myjob \ --output-dir training_runs/myjob \ --dataset sample_data/train.jsonl \ --prompt-field prompt --response-field response \ --epochs 1 --batch-size 1 --gradient-accumulation 8 \ --use-fp16 \ --grpo --cpt \ --export-gguf --gguf-out training_runs/myjob/adapter-gguf-q4_k_xl ``` **Flags:** - `--grpo`: Enable GRPO (if supported by Unsloth) - `--cpt`: Enable CPT (if supported by Unsloth) - `--export-gguf`: Export to GGUF Q4_K_XL after training - `--gguf-out`: Path to save GGUF file **Notes:** - On Mac, bitsandbytes/xformers are disabled automatically. - Training is slower than on CUDA GPUs; use small batch sizes and gradient accumulation. - If Unsloth's GGUF export is unavailable, follow the printed instructions to use llama.cpp's `convert-hf-to-gguf.py`. ## Troubleshooting - If you see errors about missing CUDA or bitsandbytes, ensure you are running on Apple Silicon and have the latest Unsloth/Transformers. - For memory errors, reduce `--batch-size` or `--cutoff-len`. - For best results, use datasets formatted to match the official Gemma 3n chat template. ## Example: Manual GGUF Export with llama.cpp If the script prints a message about manual conversion, run: ```bash python convert-hf-to-gguf.py --outtype q4_k_xl --outfile training_runs/myjob/adapter-gguf-q4_k_xl training_runs/myjob/adapter ``` ## References - [Unsloth Documentation](https://unsloth.ai/) - [Gemma 3n E4B Model Card](https://huggingface.co/unsloth/gemma-3n-E4B-it) - [llama.cpp GGUF Export Guide](https://github.com/ggerganov/llama.cpp) --- title: Multimodal AI Backend Service emoji: πŸš€ colorFrom: yellow colorTo: purple sdk: docker app_port: 8000 pinned: false --- # firstAI - Multimodal AI Backend πŸš€ A powerful AI backend service with **multimodal capabilities** and **advanced deployment support** - supporting both text generation and image analysis using transformers pipelines. ## πŸŽ‰ Features ### πŸ€– Configurable AI Models - **Default Text Model**: Microsoft DialoGPT-medium (deployment-friendly) - **Advanced Models**: Support for quantized models (Unsloth, 4-bit, GGUF) - **Environment Configuration**: Runtime model selection via environment variables - **Quantization Support**: Automatic 4-bit quantization with fallback mechanisms ### πŸ–ΌοΈ Multimodal Support - Process text-only messages - Analyze images from URLs - Combined image + text conversations - OpenAI Vision API compatible format ### οΏ½ Production Ready - **Enhanced Deployment**: Multi-level fallback for quantized models - **Environment Flexibility**: Works in constrained deployment environments - **Error Resilience**: Comprehensive error handling with graceful degradation - FastAPI backend with automatic docs - Health checks and monitoring - PyTorch with MPS acceleration (Apple Silicon) ### πŸ”§ Model Configuration Configure models via environment variables: ```bash # Set custom text model (optional) export AI_MODEL="microsoft/DialoGPT-medium" # Set custom vision model (optional) export VISION_MODEL="Salesforce/blip-image-captioning-base" # For private models (optional) export HF_TOKEN="your_huggingface_token" ``` **Supported Model Types:** - Standard models: `microsoft/DialoGPT-medium`, `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B` - Quantized models: `unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit` - GGUF models: `unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF` ## πŸš€ Quick Start ### 1. Install Dependencies ```bash pip install -r requirements.txt ``` ### 2. Start the Service ```bash python backend_service.py ``` ### 3. Test Multimodal Capabilities ```bash python test_final.py ``` The service will start on **http://localhost:8001** with both text and vision models loaded. ## πŸ’‘ Usage Examples ### Text-Only Chat ```bash curl -X POST http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "microsoft/DialoGPT-medium", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ### Image Analysis ```bash curl -X POST http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Salesforce/blip-image-captioning-base", "messages": [ { "role": "user", "content": [ { "type": "image", "url": "https://example.com/image.jpg" } ] } ] }' ``` ### Multimodal (Image + Text) ```bash curl -X POST http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Salesforce/blip-image-captioning-base", "messages": [ { "role": "user", "content": [ { "type": "image", "url": "https://example.com/image.jpg" }, { "type": "text", "text": "What do you see in this image?" } ] } ] }' ``` ## πŸ”§ Technical Details ### Architecture - **FastAPI** web framework - **Transformers** pipeline for AI models - **PyTorch** backend with GPU/MPS support - **Pydantic** for request/response validation ### Models - **Text**: microsoft/DialoGPT-medium - **Vision**: Salesforce/blip-image-captioning-base ### API Endpoints - `GET /` - Service information - `GET /health` - Health check - `GET /v1/models` - List available models - `POST /v1/chat/completions` - Chat completions (text/multimodal) - `GET /docs` - Interactive API documentation ## πŸš€ Deployment ### Environment Variables ```bash # Optional: Custom models export AI_MODEL="microsoft/DialoGPT-medium" export VISION_MODEL="Salesforce/blip-image-captioning-base" export HF_TOKEN="your_token_here" # For private models ``` ### Production Deployment The service includes enhanced deployment capabilities: - **Quantized Model Support**: Automatic handling of 4-bit and GGUF models - **Fallback Mechanisms**: Multi-level fallback for constrained environments - **Error Resilience**: Graceful degradation when quantization libraries unavailable ### Docker Deployment ```bash # Build and run with Docker docker build -t firstai . docker run -p 8000:8000 firstai ``` ### Testing Deployment ```bash # Test quantization detection and fallbacks python test_deployment_fallbacks.py # Test health endpoint curl http://localhost:8000/health ``` For comprehensive deployment guidance, see `DEPLOYMENT_ENHANCEMENTS.md`. ## πŸ§ͺ Testing Run the comprehensive test suite: ```bash python test_final.py ``` Test individual components: ```bash python test_multimodal.py # Basic multimodal tests python test_pipeline.py # Pipeline compatibility ``` ## πŸ“¦ Dependencies Key packages: - `fastapi` - Web framework - `transformers` - AI model pipelines - `torch` - PyTorch backend - `Pillow` - Image processing - `accelerate` - Model acceleration - `requests` - HTTP client ## 🎯 Integration Complete This project successfully integrates: βœ… **Transformers image-text-to-text pipeline** βœ… **OpenAI Vision API compatibility** βœ… **Multimodal message processing** βœ… **Production-ready FastAPI service** See `MULTIMODAL_INTEGRATION_COMPLETE.md` for detailed integration documentation. - PyTorch with MPS acceleration (Apple Silicon) AI Backend Service emoji: οΏ½ colorFrom: yellow colorTo: purple sdk: fastapi sdk_version: 0.100.0 app_file: backend_service.py pinned: false --- # AI Backend Service πŸš€ **Status: βœ… CONVERSION COMPLETE!** Successfully converted from a non-functioning Gradio HuggingFace app to a production-ready FastAPI backend service with OpenAI-compatible API endpoints. ## Quick Start ### 1. Setup Environment ```bash # Activate the virtual environment source gradio_env/bin/activate # Install dependencies (already done) pip install -r requirements.txt ``` ### 2. Start the Backend Service ```bash python backend_service.py --port 8000 --reload ``` ### 3. Test the API ```bash # Run comprehensive tests python test_api.py # Or try usage examples python usage_examples.py ``` ## API Endpoints | Endpoint | Method | Description | | ---------------------- | ------ | ----------------------------------- | | `/` | GET | Service information | | `/health` | GET | Health check | | `/v1/models` | GET | List available models | | `/v1/chat/completions` | POST | Chat completion (OpenAI compatible) | | `/v1/completions` | POST | Text completion | ## Example Usage ### Chat Completion ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "microsoft/DialoGPT-medium", "messages": [ {"role": "user", "content": "Hello! How are you?"} ], "max_tokens": 150, "temperature": 0.7 }' ``` ### Streaming Chat ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "microsoft/DialoGPT-medium", "messages": [ {"role": "user", "content": "Tell me a joke"} ], "stream": true }' ``` ## Files - **`app.py`** - Original Gradio ChatInterface (still functional) - **`backend_service.py`** - New FastAPI backend service ⭐ - **`test_api.py`** - Comprehensive API testing - **`usage_examples.py`** - Simple usage examples - **`requirements.txt`** - Updated dependencies - **`CONVERSION_COMPLETE.md`** - Detailed conversion documentation ## Features βœ… **OpenAI-Compatible API** - Drop-in replacement for OpenAI API βœ… **Async FastAPI** - High-performance async architecture βœ… **Streaming Support** - Real-time response streaming βœ… **Error Handling** - Robust error handling with fallbacks βœ… **Production Ready** - CORS, logging, health checks βœ… **Docker Ready** - Easy containerization βœ… **Auto-reload** - Development-friendly auto-reload βœ… **Type Safety** - Full type hints with Pydantic validation ## Service URLs - **Backend Service**: http://localhost:8000 - **API Documentation**: http://localhost:8000/docs - **OpenAPI Spec**: http://localhost:8000/openapi.json ## Model Information - **Current Model**: `microsoft/DialoGPT-medium` - **Type**: Conversational AI model - **Provider**: HuggingFace Inference API - **Capabilities**: Text generation, chat completion ## Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Client Request │───▢│ FastAPI Backend │───▢│ HuggingFace API β”‚ β”‚ (OpenAI format) β”‚ β”‚ (backend_service) β”‚ β”‚ (DialoGPT-medium) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ OpenAI Response β”‚ β”‚ (JSON/Streaming) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## Development The service includes: - **Auto-reload** for development - **Comprehensive logging** for debugging - **Type checking** for code quality - **Test suite** for reliability - **Error handling** for robustness ## Production Deployment Ready for production with: - **Environment variables** for configuration - **Health check endpoints** for monitoring - **CORS support** for web applications - **Docker compatibility** for containerization - **Structured logging** for observability --- **πŸŽ‰ Conversion Status: COMPLETE!** Successfully transformed from broken Gradio app to production-ready AI backend service. For detailed conversion documentation, see [`CONVERSION_COMPLETE.md`](CONVERSION_COMPLETE.md). # Gemma 3n GGUF FastAPI Backend (Hugging Face Space) This Space provides an OpenAI-compatible chat API for Gemma 3n GGUF models, powered by FastAPI. **Note:** On Hugging Face Spaces, the backend runs in `DEMO_MODE` (no model loaded) for demonstration and endpoint testing. For real inference, run locally with a GGUF model and llama-cpp-python. ## Endpoints - `/health` β€” Health check - `/v1/chat/completions` β€” OpenAI-style chat completions (returns demo response) - `/train/start` β€” Start a (demo) training job - `/train/status/{job_id}` β€” Check training job status - `/train/logs/{job_id}` β€” Get training logs ## Usage 1. **Clone this repo** or create a Hugging Face Space (type: FastAPI). 2. All dependencies are in `requirements.txt`. 3. The Space will start in demo mode (no model download required). ## Local Inference (with GGUF) To run with a real model locally: 1. Download a Gemma 3n GGUF model (e.g. from https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF). 2. Set `AI_MODEL` to the local path or repo. 3. Unset `DEMO_MODE`. 4. Run: ```bash pip install -r requirements.txt uvicorn gemma_gguf_backend:app --host 0.0.0.0 --port 8000 ``` ## License Apache 2.0