Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter
This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.
How to Deploy as an Endpoint
Upload the
adapterdirectory (produced by training) to your Hugging Face Hub repository.- The directory should contain
adapter_config.json,adapter_model.bin, and tokenizer files.
- The directory should contain
Add a
handler.pyfile to define the endpoint logic.Push to the Hugging Face Hub.
Deploy as an Inference Endpoint via the Hugging Face UI.
Example handler.py
This file loads the base model and LoRA adapter, and exposes a __call__ method for inference.
from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch
class EndpointHandler:
def __init__(self, path="."):
# Load base model and tokenizer
base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b"
self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
# Load LoRA adapter
self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
self.model.eval()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
prompt = data["inputs"] if isinstance(data, dict) else data
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
output = self.model.generate(**inputs, max_new_tokens=256)
decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
return {"generated_text": decoded}
- Replace
<BASE_MODEL_ID>with the correct base model (e.g.,google/gemma-2b). - The endpoint will accept a JSON payload with an
inputsfield containing the prompt.
Notes
- Make sure your
requirements.txtincludestransformers,peft, andtorch. - For large models, use an Inference Endpoint with GPU.
- You can customize the handler for chat formatting, streaming, etc.
Quickstart
- Train your adapter with
train_gemma_unsloth.py. - Upload the
adapterdirectory andhandler.pyto your Hugging Face repo. - Deploy as an Inference Endpoint.
- Send requests to your endpoint!