metadata
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
library_name: transformers
license: llama3.2
tags:
- lora
- llama
- vision-language
- peft
- fine-tuned
inference: false
pipeline_tag: image-text-to-text
lavender-llama-3.2-11b-lora
π LoRA fine-tuned model based on meta-llama/Llama-3.2-11B-Vision-Instruct
Model Overview
This is a Lavender fine-tuned version of Llama-3.2-11B-Vision-Instruct. Lavender is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. This model retains the core capabilities of Llama-3.2 while incorporating Stable Diffusion's visual expertise.
- Base Model:
meta-llama/Llama-3.2-11B-Vision-Instruct
- Fine-Tuned Model:
lxasqjc/lavender-llama-3.2-11b-lora
- Lavender Paper: Diffusion Instruction Tuning (arXiv)
- Lavender Project Space: Diffusion Instruction Tuning
- Github repository: Diffusion Instruction Tuning
- Parameter Efficient Fine-Tuning (PEFT): Uses LoRA (Low-Rank Adaptation) to optimize model efficiency.
- License: Llama 3.2 Community License (See
LICENSE.txt
)
π οΈ How to Use This Model
This model contains only LoRA weights. To use it, you must first load the base model and then apply the LoRA adapter.
Install Required Packages
pip install torch transformers accelerate peft
Load and Use the LoRA Model
from transformers import MllamaForConditionalGeneration, MllamaProcessor
from peft import PeftModel
import requests
import torch
from PIL import Image
# Define model paths
base_model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
lora_model_name = "lxasqjc/lavender-llama-3.2-11b-lora"
# Load base model
base_model = MllamaForConditionalGeneration.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = MllamaProcessor.from_pretrained(base_model_name)
# Load LoRA adapter
lora_model = PeftModel.from_pretrained(base_model, lora_model_name)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
image,
input_text,
add_special_tokens=False,
return_tensors="pt"
).to(lora_model.device)
output = lora_model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))
π‘ Applications
- Vision-Language Tasks: Describing images, answering visual questions.
- Instruction-Following: Improved instruction compliance compared to base Llama-3.2.
- Multimodal Reasoning: Can process and reason about images with text.
π Training Details
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Base Model:
meta-llama/Llama-3.2-11B-Vision-Instruct
- PEFT Framework: Hugging Face PEFT
- Precision:
bfloat16
- Hyperparameters: Coming soon
π Limitations & Considerations
- This model is not a standalone model. It requires the base Llama-3.2-11B-Vision-Instruct model.
- Biases & Ethical Use: Like all large models, it may exhibit biases present in the pretraining data.
- Hardware Requirements: Minimum 24GB VRAM (A100 40GB recommended) for inference.
π References
- Base Model: Meta Llama-3.2-11B-Vision-Instruct
- LoRA Paper: Low-Rank Adaptation of Large Language Models
- PEFT Documentation: Hugging Face PEFT
- Project Space: Diffusion Instruction Tuning
- Paper: Diffusion Instruction Tuning (arXiv)
Citation
If you use this model or work in your research, please cite:
@misc{jin2025diffusioninstructiontuning,
title={Diffusion Instruction Tuning},
author={Chen Jin and Ryutaro Tanno and Amrutha Saseendran and Tom Diethe and Philip Teare},
year={2025},
eprint={2502.06814},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.06814},
}