LLaVA-Phi Model
This is a vision-language model based on Microsoft's Phi-1.5 architecture with CLIP for image processing capabilities.
Model Description
- Base Model: Microsoft Phi-1.5
- Vision Encoder: CLIP ViT-B/32
- Training: QLoRA fine-tuning
- Dataset: Instruct 150K
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
from PIL import Image
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("sagar007/Lava_phi")
tokenizer = AutoTokenizer.from_pretrained("sagar007/Lava_phi")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
# For text
def generate_text(prompt):
inputs = tokenizer(f"human: {prompt}\ngpt:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# For images
def process_image_and_prompt(image_path, prompt):
image = Image.open(image_path)
image_tensor = processor(images=image, return_tensors="pt").pixel_values
inputs = tokenizer(f"human: <image>\n{prompt}\ngpt:", return_tensors="pt")
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
images=image_tensor,
max_new_tokens=128
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Training Details
- Trained using QLoRA (Quantized Low-Rank Adaptation)
- 4-bit quantization for efficiency
- Gradient checkpointing enabled
- Mixed precision training (bfloat16)
License
MIT License
Citation
@software{llava_phi_2024,
author = {sagar007},
title = {LLaVA-Phi: Vision-Language Model},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/sagar007/Lava_phi}
}
- Downloads last month
- 78
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for sagar007/Lava_phi
Base model
microsoft/phi-1_5