Llama-3.2-11B-Vision-Instruct
This is a model based on the Llama-3.2-11B-Vision-Instruct model by Meta. It is finetuned for multimodal generation.
Model Description
This model is a vision-language model capable of generating text from a given image and text prompt. It's based on the Llama 3.2 architecture and has been instruction-tuned for improved performance on a variety of tasks, including:
- Image captioning: Generating descriptive captions for images.
- Visual question answering: Answering questions about the content of images.
- Image-based dialogue: Engaging in conversations based on visual input.
Intended Uses & Limitations
This model is intended for research purposes and should be used responsibly. It may generate incorrect or misleading information, and should not be used for making critical decisions.
Limitations:
- The model may not always accurately interpret the content of images.
- It may be biased towards certain types of images or concepts.
- It may generate inappropriate or offensive content.
How to Use
Here's an example of how to use this model in Python with the transformers
library:
import gradio as gr
from transformers import AutoProcessor, MllamaForConditionalGeneration
# Use GPU if available, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the model and processor
model_name = "ruslanmv/Llama-3.2-11B-Vision-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = MllamaForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Function to generate model response
def predict(message, image):
messages = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": message}
]}]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(device)
response = model.generate(**inputs, max_new_tokens=100)
return processor.decode(response[0], skip_special_tokens=True)
# Gradio interface
with gr.Blocks() as demo:
gr.Markdown("# Simple Multimodal Chatbot")
with gr.Row():
with gr.Column(): # Message input on the left
text_input = gr.Textbox(label="Message")
submit_button = gr.Button("Send")
with gr.Column(): # Image input on the right
image_input = gr.Image(type="pil", label="Upload an Image")
chatbot = gr.Chatbot() # Chatbot output at the bottom
def respond(message, image, history):
history = history + [(message, "")]
response = predict(message, image)
history[-1] = (message, response)
return history
submit_button.click(
fn=respond,
inputs=[text_input, image_input, chatbot],
outputs=chatbot
)
demo.launch()
This code provides a simple Gradio interface for interacting with the model. You can upload an image and type a message, and the model will generate a response based on both inputs.
More Information
For more details and examples, please visit ruslanmv.com.
License
This model is licensed under the Llama 3.2 Community License Agreement.
- Downloads last month
- 75