Spaces:

cong182
/

firstAI

Sleeping

firstAI / README.md

ndc8

65edee9 3 months ago

15.5 kB

	# Hugging Face Spaces: FastAPI OpenAI-Compatible Backend

	This project is now ready to deploy as a Hugging Face Space using FastAPI and transformers (no vLLM, no llama-cpp/gguf).

	## Features

	- OpenAI-compatible `/v1/chat/completions` endpoint
	- Multimodal support (text + image, if model supports)
	- Environment variable support via `.env`
	- Hugging Face Spaces compatible (CPU or T4/RTX GPU)

	## Usage (Local)

	```bash
	pip install -r requirements.txt
	python -m uvicorn backend_service:app --host 0.0.0.0 --port 7860
	```

	## Usage (Hugging Face Spaces)

	- Push this repo to your Hugging Face Space
	- Space will auto-launch with FastAPI backend
	- Use `/v1/chat/completions` endpoint for OpenAI-compatible clients

	## Notes

	- Only transformers models are supported (no GGUF/llama-cpp, no vLLM)
	- Set your model in the `AI_MODEL` environment variable or edit `backend_service.py`
	- For secrets, use the Hugging Face Spaces Secrets UI or a `.env` file

	## Example curl

	```bash
	curl -X POST https://<your-space>.hf.space/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{"model": "google/gemma-3n-E4B-it", "messages": [{"role": "user", "content": "Hello!"}]}'
	```

	---

	For more, see Hugging Face Spaces docs: https://huggingface.co/docs/hub/spaces-sdks-docker

	# Fallback Logic

	If vLLM fails to start or respond, the backend will automatically fallback to the legacy backend.

	# Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth

	This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.

	## Prerequisites

	- Python 3.10+
	- macOS with Apple Silicon (M1/M2/M3)
	- PyTorch with MPS backend (install via `pip install torch`)
	- All dependencies in `requirements.txt` (install with `pip install -r requirements.txt`)

	## Training Script Usage

	Run the training script with your dataset (JSON/JSONL or Hugging Face format):

	```bash
	python training/train_gemma_unsloth.py \
	--job-id myjob \
	--output-dir training_runs/myjob \
	--dataset sample_data/train.jsonl \
	--prompt-field prompt --response-field response \
	--epochs 1 --batch-size 1 --gradient-accumulation 8 \
	--use-fp16 \
	--grpo --cpt \
	--export-gguf --gguf-out training_runs/myjob/adapter-gguf-q4_k_xl
	```

	Flags:

	- `--grpo`: Enable GRPO (if supported by Unsloth)
	- `--cpt`: Enable CPT (if supported by Unsloth)
	- `--export-gguf`: Export to GGUF Q4_K_XL after training
	- `--gguf-out`: Path to save GGUF file

	Notes:

	- On Mac, bitsandbytes/xformers are disabled automatically.
	- Training is slower than on CUDA GPUs; use small batch sizes and gradient accumulation.
	- If Unsloth's GGUF export is unavailable, follow the printed instructions to use llama.cpp's `convert-hf-to-gguf.py`.

	## Troubleshooting

	- If you see errors about missing CUDA or bitsandbytes, ensure you are running on Apple Silicon and have the latest Unsloth/Transformers.
	- For memory errors, reduce `--batch-size` or `--cutoff-len`.
	- For best results, use datasets formatted to match the official Gemma 3n chat template.

	## Example: Manual GGUF Export with llama.cpp

	If the script prints a message about manual conversion, run:

	```bash
	python convert-hf-to-gguf.py --outtype q4_k_xl --outfile training_runs/myjob/adapter-gguf-q4_k_xl training_runs/myjob/adapter
	```

	## References

	- [Unsloth Documentation](https://unsloth.ai/)
	- [Gemma 3n E4B Model Card](https://huggingface.co/unsloth/gemma-3n-E4B-it)
	- [llama.cpp GGUF Export Guide](https://github.com/ggerganov/llama.cpp)

	---

	title: Multimodal AI Backend Service
	emoji: 🚀
	colorFrom: yellow
	colorTo: purple
	sdk: docker
	app_port: 8000
	pinned: false

	---

	# firstAI - Multimodal AI Backend 🚀

	A powerful AI backend service with multimodal capabilities and advanced deployment support - supporting both text generation and image analysis using transformers pipelines.

	## 🎉 Features

	### 🤖 Configurable AI Models

	- Default Text Model: Microsoft DialoGPT-medium (deployment-friendly)
	- Advanced Models: Support for quantized models (Unsloth, 4-bit, GGUF)
	- Environment Configuration: Runtime model selection via environment variables
	- Quantization Support: Automatic 4-bit quantization with fallback mechanisms

	### 🖼️ Multimodal Support

	- Process text-only messages
	- Analyze images from URLs
	- Combined image + text conversations
	- OpenAI Vision API compatible format

	### � Production Ready

	- Enhanced Deployment: Multi-level fallback for quantized models
	- Environment Flexibility: Works in constrained deployment environments
	- Error Resilience: Comprehensive error handling with graceful degradation
	- FastAPI backend with automatic docs
	- Health checks and monitoring
	- PyTorch with MPS acceleration (Apple Silicon)

	### 🔧 Model Configuration

	Configure models via environment variables:

	```bash
	# Set custom text model (optional)
	export AI_MODEL="microsoft/DialoGPT-medium"

	# Set custom vision model (optional)
	export VISION_MODEL="Salesforce/blip-image-captioning-base"

	# For private models (optional)
	export HF_TOKEN="your_huggingface_token"
	```

	Supported Model Types:

	- Standard models: `microsoft/DialoGPT-medium`, `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`
	- Quantized models: `unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit`
	- GGUF models: `unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF`

	## 🚀 Quick Start

	### 1. Install Dependencies

	```bash
	pip install -r requirements.txt
	```

	### 2. Start the Service

	```bash
	python backend_service.py
	```

	### 3. Test Multimodal Capabilities

	```bash
	python test_final.py
	```

	The service will start on http://localhost:8001 with both text and vision models loaded.

	## 💡 Usage Examples

	### Text-Only Chat

	```bash
	curl -X POST http://localhost:8001/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "microsoft/DialoGPT-medium",
	"messages": [{"role": "user", "content": "Hello!"}]
	}'
	```

	### Image Analysis

	```bash
	curl -X POST http://localhost:8001/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "Salesforce/blip-image-captioning-base",
	"messages": [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"url": "https://example.com/image.jpg"
	}
	]
	}
	]
	}'
	```

	### Multimodal (Image + Text)

	```bash
	curl -X POST http://localhost:8001/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "Salesforce/blip-image-captioning-base",
	"messages": [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"url": "https://example.com/image.jpg"
	},
	{
	"type": "text",
	"text": "What do you see in this image?"
	}
	]
	}
	]
	}'
	```

	## 🔧 Technical Details

	### Architecture

	- FastAPI web framework
	- Transformers pipeline for AI models
	- PyTorch backend with GPU/MPS support
	- Pydantic for request/response validation

	### Models

	- Text: microsoft/DialoGPT-medium
	- Vision: Salesforce/blip-image-captioning-base

	### API Endpoints

	- `GET /` - Service information
	- `GET /health` - Health check
	- `GET /v1/models` - List available models
	- `POST /v1/chat/completions` - Chat completions (text/multimodal)
	- `GET /docs` - Interactive API documentation

	## 🚀 Deployment

	### Environment Variables

	```bash
	# Optional: Custom models
	export AI_MODEL="microsoft/DialoGPT-medium"
	export VISION_MODEL="Salesforce/blip-image-captioning-base"
	export HF_TOKEN="your_token_here" # For private models
	```

	### Production Deployment

	The service includes enhanced deployment capabilities:

	- Quantized Model Support: Automatic handling of 4-bit and GGUF models
	- Fallback Mechanisms: Multi-level fallback for constrained environments
	- Error Resilience: Graceful degradation when quantization libraries unavailable

	### Docker Deployment

	```bash
	# Build and run with Docker
	docker build -t firstai .
	docker run -p 8000:8000 firstai
	```

	### Testing Deployment

	```bash
	# Test quantization detection and fallbacks
	python test_deployment_fallbacks.py

	# Test health endpoint
	curl http://localhost:8000/health
	```

	For comprehensive deployment guidance, see `DEPLOYMENT_ENHANCEMENTS.md`.

	## 🧪 Testing

	Run the comprehensive test suite:

	```bash
	python test_final.py
	```

	Test individual components:

	```bash
	python test_multimodal.py # Basic multimodal tests
	python test_pipeline.py # Pipeline compatibility
	```

	## 📦 Dependencies

	Key packages:

	- `fastapi` - Web framework
	- `transformers` - AI model pipelines
	- `torch` - PyTorch backend
	- `Pillow` - Image processing
	- `accelerate` - Model acceleration
	- `requests` - HTTP client

	## 🎯 Integration Complete

	This project successfully integrates:
	✅ Transformers image-text-to-text pipeline
	✅ OpenAI Vision API compatibility
	✅ Multimodal message processing
	✅ Production-ready FastAPI service

	See `MULTIMODAL_INTEGRATION_COMPLETE.md` for detailed integration documentation.

	- PyTorch with MPS acceleration (Apple Silicon) AI Backend Service
	emoji: �
	colorFrom: yellow
	colorTo: purple
	sdk: fastapi
	sdk_version: 0.100.0
	app_file: backend_service.py
	pinned: false

	---

	# AI Backend Service 🚀

	Status: ✅ CONVERSION COMPLETE!

	Successfully converted from a non-functioning Gradio HuggingFace app to a production-ready FastAPI backend service with OpenAI-compatible API endpoints.

	## Quick Start

	### 1. Setup Environment

	```bash
	# Activate the virtual environment
	source gradio_env/bin/activate

	# Install dependencies (already done)
	pip install -r requirements.txt
	```

	### 2. Start the Backend Service

	```bash
	python backend_service.py --port 8000 --reload
	```

	### 3. Test the API

	```bash
	# Run comprehensive tests
	python test_api.py

	# Or try usage examples
	python usage_examples.py
	```

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\| ---------------------- \| ------ \| ----------------------------------- \|
	\| `/` \| GET \| Service information \|
	\| `/health` \| GET \| Health check \|
	\| `/v1/models` \| GET \| List available models \|
	\| `/v1/chat/completions` \| POST \| Chat completion (OpenAI compatible) \|
	\| `/v1/completions` \| POST \| Text completion \|

	## Example Usage

	### Chat Completion

	```bash
	curl -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "microsoft/DialoGPT-medium",
	"messages": [
	{"role": "user", "content": "Hello! How are you?"}
	],
	"max_tokens": 150,
	"temperature": 0.7
	}'
	```

	### Streaming Chat

	```bash
	curl -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "microsoft/DialoGPT-medium",
	"messages": [
	{"role": "user", "content": "Tell me a joke"}
	],
	"stream": true
	}'
	```

	## Files

	- `app.py` - Original Gradio ChatInterface (still functional)
	- `backend_service.py` - New FastAPI backend service ⭐
	- `test_api.py` - Comprehensive API testing
	- `usage_examples.py` - Simple usage examples
	- `requirements.txt` - Updated dependencies
	- `CONVERSION_COMPLETE.md` - Detailed conversion documentation

	## Features

	✅ OpenAI-Compatible API - Drop-in replacement for OpenAI API
	✅ Async FastAPI - High-performance async architecture
	✅ Streaming Support - Real-time response streaming
	✅ Error Handling - Robust error handling with fallbacks
	✅ Production Ready - CORS, logging, health checks
	✅ Docker Ready - Easy containerization
	✅ Auto-reload - Development-friendly auto-reload
	✅ Type Safety - Full type hints with Pydantic validation

	## Service URLs

	- Backend Service: http://localhost:8000
	- API Documentation: http://localhost:8000/docs
	- OpenAPI Spec: http://localhost:8000/openapi.json

	## Model Information

	- Current Model: `microsoft/DialoGPT-medium`
	- Type: Conversational AI model
	- Provider: HuggingFace Inference API
	- Capabilities: Text generation, chat completion

	## Architecture

	```
	┌─────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐
	│ Client Request │───▶│ FastAPI Backend │───▶│ HuggingFace API │
	│ (OpenAI format) │ │ (backend_service) │ │ (DialoGPT-medium) │
	└─────────────────────┘ └──────────────────────┘ └─────────────────────┘
	│
	▼
	┌──────────────────────┐
	│ OpenAI Response │
	│ (JSON/Streaming) │
	└──────────────────────┘
	```

	## Development

	The service includes:

	- Auto-reload for development
	- Comprehensive logging for debugging
	- Type checking for code quality
	- Test suite for reliability
	- Error handling for robustness

	## Production Deployment

	Ready for production with:

	- Environment variables for configuration
	- Health check endpoints for monitoring
	- CORS support for web applications
	- Docker compatibility for containerization
	- Structured logging for observability

	---

	🎉 Conversion Status: COMPLETE!
	Successfully transformed from broken Gradio app to production-ready AI backend service.

	For detailed conversion documentation, see [`CONVERSION_COMPLETE.md`](CONVERSION_COMPLETE.md).

	# Gemma 3n GGUF FastAPI Backend (Hugging Face Space)

	This Space provides an OpenAI-compatible chat API for Gemma 3n GGUF models, powered by FastAPI.

	Note: On Hugging Face Spaces, the backend runs in `DEMO_MODE` (no model loaded) for demonstration and endpoint testing. For real inference, run locally with a GGUF model and llama-cpp-python.

	## Endpoints

	- `/health` — Health check
	- `/v1/chat/completions` — OpenAI-style chat completions (returns demo response)
	- `/train/start` — Start a (demo) training job
	- `/train/status/{job_id}` — Check training job status
	- `/train/logs/{job_id}` — Get training logs

	## Usage

	1. Clone this repo or create a Hugging Face Space (type: FastAPI).
	2. All dependencies are in `requirements.txt`.
	3. The Space will start in demo mode (no model download required).

	## Local Inference (with GGUF)

	To run with a real model locally:

	1. Download a Gemma 3n GGUF model (e.g. from https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF).
	2. Set `AI_MODEL` to the local path or repo.
	3. Unset `DEMO_MODE`.
	4. Run:
	```bash
	pip install -r requirements.txt
	uvicorn gemma_gguf_backend:app --host 0.0.0.0 --port 8000
	```

	## License

	Apache 2.0