# Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth

This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.

## Prerequisites

- Python 3.10+
- macOS with Apple Silicon (M1/M2/M3)
- PyTorch with MPS backend (install via `pip install torch`)
- All dependencies in `requirements.txt` (install with `pip install -r requirements.txt`)

## Training Script Usage

Run the training script with your dataset (JSON/JSONL or Hugging Face format):

```bash
python training/train_gemma_unsloth.py \
  --job-id myjob \
  --output-dir training_runs/myjob \
  --dataset sample_data/train.jsonl \
  --prompt-field prompt --response-field response \
  --epochs 1 --batch-size 1 --gradient-accumulation 8 \
  --use-fp16 \
  --grpo --cpt \
  --export-gguf --gguf-out training_runs/myjob/adapter-gguf-q4_k_xl
```

**Flags:**

- `--grpo`: Enable GRPO (if supported by Unsloth)
- `--cpt`: Enable CPT (if supported by Unsloth)
- `--export-gguf`: Export to GGUF Q4_K_XL after training
- `--gguf-out`: Path to save GGUF file

**Notes:**

- On Mac, bitsandbytes/xformers are disabled automatically.
- Training is slower than on CUDA GPUs; use small batch sizes and gradient accumulation.
- If Unsloth's GGUF export is unavailable, follow the printed instructions to use llama.cpp's `convert-hf-to-gguf.py`.

## Troubleshooting

- If you see errors about missing CUDA or bitsandbytes, ensure you are running on Apple Silicon and have the latest Unsloth/Transformers.
- For memory errors, reduce `--batch-size` or `--cutoff-len`.
- For best results, use datasets formatted to match the official Gemma 3n chat template.

## Example: Manual GGUF Export with llama.cpp

If the script prints a message about manual conversion, run:

```bash
python convert-hf-to-gguf.py --outtype q4_k_xl --outfile training_runs/myjob/adapter-gguf-q4_k_xl training_runs/myjob/adapter
```

## References

- [Unsloth Documentation](https://unsloth.ai/)
- [Gemma 3n E4B Model Card](https://huggingface.co/unsloth/gemma-3n-E4B-it)
- [llama.cpp GGUF Export Guide](https://github.com/ggerganov/llama.cpp)

---

title: Multimodal AI Backend Service
emoji: 🚀
colorFrom: yellow
colorTo: purple
sdk: docker
app_port: 8000
pinned: false

---

# firstAI - Multimodal AI Backend 🚀

A powerful AI backend service with **multimodal capabilities** and **advanced deployment support** - supporting both text generation and image analysis using transformers pipelines.

## 🎉 Features

### 🤖 Configurable AI Models

- **Default Text Model**: Microsoft DialoGPT-medium (deployment-friendly)
- **Advanced Models**: Support for quantized models (Unsloth, 4-bit, GGUF)
- **Environment Configuration**: Runtime model selection via environment variables
- **Quantization Support**: Automatic 4-bit quantization with fallback mechanisms

### 🖼️ Multimodal Support

- Process text-only messages
- Analyze images from URLs
- Combined image + text conversations
- OpenAI Vision API compatible format

### � Production Ready

- **Enhanced Deployment**: Multi-level fallback for quantized models
- **Environment Flexibility**: Works in constrained deployment environments
- **Error Resilience**: Comprehensive error handling with graceful degradation
- FastAPI backend with automatic docs
- Health checks and monitoring
- PyTorch with MPS acceleration (Apple Silicon)

### 🔧 Model Configuration

Configure models via environment variables:

```bash
# Set custom text model (optional)
export AI_MODEL="microsoft/DialoGPT-medium"

# Set custom vision model (optional)
export VISION_MODEL="Salesforce/blip-image-captioning-base"

# For private models (optional)
export HF_TOKEN="your_huggingface_token"
```

**Supported Model Types:**

- Standard models: `microsoft/DialoGPT-medium`, `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`
- Quantized models: `unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit`
- GGUF models: `unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF`

## 🚀 Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Start the Service

```bash
python backend_service.py
```

### 3. Test Multimodal Capabilities

```bash
python test_final.py
```

The service will start on **http://localhost:8001** with both text and vision models loaded.

## 💡 Usage Examples

### Text-Only Chat

```bash
curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

### Image Analysis

```bash
curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Salesforce/blip-image-captioning-base",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "url": "https://example.com/image.jpg"
          }
        ]
      }
    ]
  }'
```

### Multimodal (Image + Text)

```bash
curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Salesforce/blip-image-captioning-base",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "url": "https://example.com/image.jpg"
          },
          {
            "type": "text",
            "text": "What do you see in this image?"
          }
        ]
      }
    ]
  }'
```

## 🔧 Technical Details

### Architecture

- **FastAPI** web framework
- **Transformers** pipeline for AI models
- **PyTorch** backend with GPU/MPS support
- **Pydantic** for request/response validation

### Models

- **Text**: microsoft/DialoGPT-medium
- **Vision**: Salesforce/blip-image-captioning-base

### API Endpoints

- `GET /` - Service information
- `GET /health` - Health check
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completions (text/multimodal)
- `GET /docs` - Interactive API documentation

## 🚀 Deployment

### Environment Variables

```bash
# Optional: Custom models
export AI_MODEL="microsoft/DialoGPT-medium"
export VISION_MODEL="Salesforce/blip-image-captioning-base"
export HF_TOKEN="your_token_here"  # For private models
```

### Production Deployment

The service includes enhanced deployment capabilities:

- **Quantized Model Support**: Automatic handling of 4-bit and GGUF models
- **Fallback Mechanisms**: Multi-level fallback for constrained environments
- **Error Resilience**: Graceful degradation when quantization libraries unavailable

### Docker Deployment

```bash
# Build and run with Docker
docker build -t firstai .
docker run -p 8000:8000 firstai
```

### Testing Deployment

```bash
# Test quantization detection and fallbacks
python test_deployment_fallbacks.py

# Test health endpoint
curl http://localhost:8000/health
```

For comprehensive deployment guidance, see `DEPLOYMENT_ENHANCEMENTS.md`.

## 🧪 Testing

Run the comprehensive test suite:

```bash
python test_final.py
```

Test individual components:

```bash
python test_multimodal.py  # Basic multimodal tests
python test_pipeline.py    # Pipeline compatibility
```

## 📦 Dependencies

Key packages:

- `fastapi` - Web framework
- `transformers` - AI model pipelines
- `torch` - PyTorch backend
- `Pillow` - Image processing
- `accelerate` - Model acceleration
- `requests` - HTTP client

## 🎯 Integration Complete

This project successfully integrates:
✅ **Transformers image-text-to-text pipeline**  
✅ **OpenAI Vision API compatibility**  
✅ **Multimodal message processing**  
✅ **Production-ready FastAPI service**

See `MULTIMODAL_INTEGRATION_COMPLETE.md` for detailed integration documentation.

- PyTorch with MPS acceleration (Apple Silicon) AI Backend Service
  emoji: �
  colorFrom: yellow
  colorTo: purple
  sdk: fastapi
  sdk_version: 0.100.0
  app_file: backend_service.py
  pinned: false

---

# AI Backend Service 🚀

**Status: ✅ CONVERSION COMPLETE!**

Successfully converted from a non-functioning Gradio HuggingFace app to a production-ready FastAPI backend service with OpenAI-compatible API endpoints.

## Quick Start

### 1. Setup Environment

```bash
# Activate the virtual environment
source gradio_env/bin/activate

# Install dependencies (already done)
pip install -r requirements.txt
```

### 2. Start the Backend Service

```bash
python backend_service.py --port 8000 --reload
```

### 3. Test the API

```bash
# Run comprehensive tests
python test_api.py

# Or try usage examples
python usage_examples.py
```

## API Endpoints

| Endpoint               | Method | Description                         |
| ---------------------- | ------ | ----------------------------------- |
| `/`                    | GET    | Service information                 |
| `/health`              | GET    | Health check                        |
| `/v1/models`           | GET    | List available models               |
| `/v1/chat/completions` | POST   | Chat completion (OpenAI compatible) |
| `/v1/completions`      | POST   | Text completion                     |

## Example Usage

### Chat Completion

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 150,
    "temperature": 0.7
  }'
```

### Streaming Chat

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Tell me a joke"}
    ],
    "stream": true
  }'
```

## Files

- **`app.py`** - Original Gradio ChatInterface (still functional)
- **`backend_service.py`** - New FastAPI backend service ⭐
- **`test_api.py`** - Comprehensive API testing
- **`usage_examples.py`** - Simple usage examples
- **`requirements.txt`** - Updated dependencies
- **`CONVERSION_COMPLETE.md`** - Detailed conversion documentation

## Features

✅ **OpenAI-Compatible API** - Drop-in replacement for OpenAI API  
✅ **Async FastAPI** - High-performance async architecture  
✅ **Streaming Support** - Real-time response streaming  
✅ **Error Handling** - Robust error handling with fallbacks  
✅ **Production Ready** - CORS, logging, health checks  
✅ **Docker Ready** - Easy containerization  
✅ **Auto-reload** - Development-friendly auto-reload  
✅ **Type Safety** - Full type hints with Pydantic validation

## Service URLs

- **Backend Service**: http://localhost:8000
- **API Documentation**: http://localhost:8000/docs
- **OpenAPI Spec**: http://localhost:8000/openapi.json

## Model Information

- **Current Model**: `microsoft/DialoGPT-medium`
- **Type**: Conversational AI model
- **Provider**: HuggingFace Inference API
- **Capabilities**: Text generation, chat completion

## Architecture

```
┌─────────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐
│   Client Request    │───▶│   FastAPI Backend    │───▶│  HuggingFace API    │
│  (OpenAI format)    │    │  (backend_service)   │    │  (DialoGPT-medium)  │
└─────────────────────┘    └──────────────────────┘    └─────────────────────┘
                                       │
                                       ▼
                           ┌──────────────────────┐
                           │   OpenAI Response    │
                           │   (JSON/Streaming)   │
                           └──────────────────────┘
```

## Development

The service includes:

- **Auto-reload** for development
- **Comprehensive logging** for debugging
- **Type checking** for code quality
- **Test suite** for reliability
- **Error handling** for robustness

## Production Deployment

Ready for production with:

- **Environment variables** for configuration
- **Health check endpoints** for monitoring
- **CORS support** for web applications
- **Docker compatibility** for containerization
- **Structured logging** for observability

---

**🎉 Conversion Status: COMPLETE!**  
Successfully transformed from broken Gradio app to production-ready AI backend service.

For detailed conversion documentation, see [`CONVERSION_COMPLETE.md`](CONVERSION_COMPLETE.md).

# Gemma 3n GGUF FastAPI Backend (Hugging Face Space)

This Space provides an OpenAI-compatible chat API for Gemma 3n GGUF models, powered by FastAPI.

**Note:** On Hugging Face Spaces, the backend runs in `DEMO_MODE` (no model loaded) for demonstration and endpoint testing. For real inference, run locally with a GGUF model and llama-cpp-python.

## Endpoints

- `/health` — Health check
- `/v1/chat/completions` — OpenAI-style chat completions (returns demo response)
- `/train/start` — Start a (demo) training job
- `/train/status/{job_id}` — Check training job status
- `/train/logs/{job_id}` — Get training logs

## Usage

1. **Clone this repo** or create a Hugging Face Space (type: FastAPI).
2. All dependencies are in `requirements.txt`.
3. The Space will start in demo mode (no model download required).

## Local Inference (with GGUF)

To run with a real model locally:

1. Download a Gemma 3n GGUF model (e.g. from https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF).
2. Set `AI_MODEL` to the local path or repo.
3. Unset `DEMO_MODE`.
4. Run:
   ```bash
   pip install -r requirements.txt
   uvicorn gemma_gguf_backend:app --host 0.0.0.0 --port 8000
   ```

## License

Apache 2.0