firstAI / README.md
ndc8
update
91181f3

Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth

This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.

Prerequisites

  • Python 3.10+
  • macOS with Apple Silicon (M1/M2/M3)
  • PyTorch with MPS backend (install via pip install torch)
  • All dependencies in requirements.txt (install with pip install -r requirements.txt)

Training Script Usage

Run the training script with your dataset (JSON/JSONL or Hugging Face format):

python training/train_gemma_unsloth.py \
  --job-id myjob \
  --output-dir training_runs/myjob \
  --dataset sample_data/train.jsonl \
  --prompt-field prompt --response-field response \
  --epochs 1 --batch-size 1 --gradient-accumulation 8 \
  --use-fp16 \
  --grpo --cpt \
  --export-gguf --gguf-out training_runs/myjob/adapter-gguf-q4_k_xl

Flags:

  • --grpo: Enable GRPO (if supported by Unsloth)
  • --cpt: Enable CPT (if supported by Unsloth)
  • --export-gguf: Export to GGUF Q4_K_XL after training
  • --gguf-out: Path to save GGUF file

Notes:

  • On Mac, bitsandbytes/xformers are disabled automatically.
  • Training is slower than on CUDA GPUs; use small batch sizes and gradient accumulation.
  • If Unsloth's GGUF export is unavailable, follow the printed instructions to use llama.cpp's convert-hf-to-gguf.py.

Troubleshooting

  • If you see errors about missing CUDA or bitsandbytes, ensure you are running on Apple Silicon and have the latest Unsloth/Transformers.
  • For memory errors, reduce --batch-size or --cutoff-len.
  • For best results, use datasets formatted to match the official Gemma 3n chat template.

Example: Manual GGUF Export with llama.cpp

If the script prints a message about manual conversion, run:

python convert-hf-to-gguf.py --outtype q4_k_xl --outfile training_runs/myjob/adapter-gguf-q4_k_xl training_runs/myjob/adapter

References


title: Multimodal AI Backend Service emoji: πŸš€ colorFrom: yellow colorTo: purple sdk: docker app_port: 8000 pinned: false


firstAI - Multimodal AI Backend πŸš€

A powerful AI backend service with multimodal capabilities and advanced deployment support - supporting both text generation and image analysis using transformers pipelines.

πŸŽ‰ Features

πŸ€– Configurable AI Models

  • Default Text Model: Microsoft DialoGPT-medium (deployment-friendly)
  • Advanced Models: Support for quantized models (Unsloth, 4-bit, GGUF)
  • Environment Configuration: Runtime model selection via environment variables
  • Quantization Support: Automatic 4-bit quantization with fallback mechanisms

πŸ–ΌοΈ Multimodal Support

  • Process text-only messages
  • Analyze images from URLs
  • Combined image + text conversations
  • OpenAI Vision API compatible format

οΏ½ Production Ready

  • Enhanced Deployment: Multi-level fallback for quantized models
  • Environment Flexibility: Works in constrained deployment environments
  • Error Resilience: Comprehensive error handling with graceful degradation
  • FastAPI backend with automatic docs
  • Health checks and monitoring
  • PyTorch with MPS acceleration (Apple Silicon)

πŸ”§ Model Configuration

Configure models via environment variables:

# Set custom text model (optional)
export AI_MODEL="microsoft/DialoGPT-medium"

# Set custom vision model (optional)
export VISION_MODEL="Salesforce/blip-image-captioning-base"

# For private models (optional)
export HF_TOKEN="your_huggingface_token"

Supported Model Types:

  • Standard models: microsoft/DialoGPT-medium, deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
  • Quantized models: unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit
  • GGUF models: unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF

πŸš€ Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Start the Service

python backend_service.py

3. Test Multimodal Capabilities

python test_final.py

The service will start on http://localhost:8001 with both text and vision models loaded.

πŸ’‘ Usage Examples

Text-Only Chat

curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Image Analysis

curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Salesforce/blip-image-captioning-base",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "url": "https://example.com/image.jpg"
          }
        ]
      }
    ]
  }'

Multimodal (Image + Text)

curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Salesforce/blip-image-captioning-base",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "url": "https://example.com/image.jpg"
          },
          {
            "type": "text",
            "text": "What do you see in this image?"
          }
        ]
      }
    ]
  }'

πŸ”§ Technical Details

Architecture

  • FastAPI web framework
  • Transformers pipeline for AI models
  • PyTorch backend with GPU/MPS support
  • Pydantic for request/response validation

Models

  • Text: microsoft/DialoGPT-medium
  • Vision: Salesforce/blip-image-captioning-base

API Endpoints

  • GET / - Service information
  • GET /health - Health check
  • GET /v1/models - List available models
  • POST /v1/chat/completions - Chat completions (text/multimodal)
  • GET /docs - Interactive API documentation

πŸš€ Deployment

Environment Variables

# Optional: Custom models
export AI_MODEL="microsoft/DialoGPT-medium"
export VISION_MODEL="Salesforce/blip-image-captioning-base"
export HF_TOKEN="your_token_here"  # For private models

Production Deployment

The service includes enhanced deployment capabilities:

  • Quantized Model Support: Automatic handling of 4-bit and GGUF models
  • Fallback Mechanisms: Multi-level fallback for constrained environments
  • Error Resilience: Graceful degradation when quantization libraries unavailable

Docker Deployment

# Build and run with Docker
docker build -t firstai .
docker run -p 8000:8000 firstai

Testing Deployment

# Test quantization detection and fallbacks
python test_deployment_fallbacks.py

# Test health endpoint
curl http://localhost:8000/health

For comprehensive deployment guidance, see DEPLOYMENT_ENHANCEMENTS.md.

πŸ§ͺ Testing

Run the comprehensive test suite:

python test_final.py

Test individual components:

python test_multimodal.py  # Basic multimodal tests
python test_pipeline.py    # Pipeline compatibility

πŸ“¦ Dependencies

Key packages:

  • fastapi - Web framework
  • transformers - AI model pipelines
  • torch - PyTorch backend
  • Pillow - Image processing
  • accelerate - Model acceleration
  • requests - HTTP client

🎯 Integration Complete

This project successfully integrates: βœ… Transformers image-text-to-text pipeline
βœ… OpenAI Vision API compatibility
βœ… Multimodal message processing
βœ… Production-ready FastAPI service

See MULTIMODAL_INTEGRATION_COMPLETE.md for detailed integration documentation.

  • PyTorch with MPS acceleration (Apple Silicon) AI Backend Service emoji: οΏ½ colorFrom: yellow colorTo: purple sdk: fastapi sdk_version: 0.100.0 app_file: backend_service.py pinned: false

AI Backend Service πŸš€

Status: βœ… CONVERSION COMPLETE!

Successfully converted from a non-functioning Gradio HuggingFace app to a production-ready FastAPI backend service with OpenAI-compatible API endpoints.

Quick Start

1. Setup Environment

# Activate the virtual environment
source gradio_env/bin/activate

# Install dependencies (already done)
pip install -r requirements.txt

2. Start the Backend Service

python backend_service.py --port 8000 --reload

3. Test the API

# Run comprehensive tests
python test_api.py

# Or try usage examples
python usage_examples.py

API Endpoints

Endpoint Method Description
/ GET Service information
/health GET Health check
/v1/models GET List available models
/v1/chat/completions POST Chat completion (OpenAI compatible)
/v1/completions POST Text completion

Example Usage

Chat Completion

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 150,
    "temperature": 0.7
  }'

Streaming Chat

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Tell me a joke"}
    ],
    "stream": true
  }'

Files

  • app.py - Original Gradio ChatInterface (still functional)
  • backend_service.py - New FastAPI backend service ⭐
  • test_api.py - Comprehensive API testing
  • usage_examples.py - Simple usage examples
  • requirements.txt - Updated dependencies
  • CONVERSION_COMPLETE.md - Detailed conversion documentation

Features

βœ… OpenAI-Compatible API - Drop-in replacement for OpenAI API
βœ… Async FastAPI - High-performance async architecture
βœ… Streaming Support - Real-time response streaming
βœ… Error Handling - Robust error handling with fallbacks
βœ… Production Ready - CORS, logging, health checks
βœ… Docker Ready - Easy containerization
βœ… Auto-reload - Development-friendly auto-reload
βœ… Type Safety - Full type hints with Pydantic validation

Service URLs

Model Information

  • Current Model: microsoft/DialoGPT-medium
  • Type: Conversational AI model
  • Provider: HuggingFace Inference API
  • Capabilities: Text generation, chat completion

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Client Request    │───▢│   FastAPI Backend    │───▢│  HuggingFace API    β”‚
β”‚  (OpenAI format)    β”‚    β”‚  (backend_service)   β”‚    β”‚  (DialoGPT-medium)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                                       β–Ό
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚   OpenAI Response    β”‚
                           β”‚   (JSON/Streaming)   β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Development

The service includes:

  • Auto-reload for development
  • Comprehensive logging for debugging
  • Type checking for code quality
  • Test suite for reliability
  • Error handling for robustness

Production Deployment

Ready for production with:

  • Environment variables for configuration
  • Health check endpoints for monitoring
  • CORS support for web applications
  • Docker compatibility for containerization
  • Structured logging for observability

πŸŽ‰ Conversion Status: COMPLETE!
Successfully transformed from broken Gradio app to production-ready AI backend service.

For detailed conversion documentation, see CONVERSION_COMPLETE.md.

Gemma 3n GGUF FastAPI Backend (Hugging Face Space)

This Space provides an OpenAI-compatible chat API for Gemma 3n GGUF models, powered by FastAPI.

Note: On Hugging Face Spaces, the backend runs in DEMO_MODE (no model loaded) for demonstration and endpoint testing. For real inference, run locally with a GGUF model and llama-cpp-python.

Endpoints

  • /health β€” Health check
  • /v1/chat/completions β€” OpenAI-style chat completions (returns demo response)
  • /train/start β€” Start a (demo) training job
  • /train/status/{job_id} β€” Check training job status
  • /train/logs/{job_id} β€” Get training logs

Usage

  1. Clone this repo or create a Hugging Face Space (type: FastAPI).
  2. All dependencies are in requirements.txt.
  3. The Space will start in demo mode (no model download required).

Local Inference (with GGUF)

To run with a real model locally:

  1. Download a Gemma 3n GGUF model (e.g. from https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF).
  2. Set AI_MODEL to the local path or repo.
  3. Unset DEMO_MODE.
  4. Run:
    pip install -r requirements.txt
    uvicorn gemma_gguf_backend:app --host 0.0.0.0 --port 8000
    

License

Apache 2.0